[ https://issues.apache.org/jira/browse/CASSANDRA-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksey Yeschenko resolved CASSANDRA-8098. ------------------------------------------ Resolution: Won't Fix Hadoop input/output formats are likely to move off-tree soon, and as such we aren't going to allocate any resources to new Hadoop-related functionality. If you come up with a 3.x patch, however, feel free to reopen the ticket. > Allow CqlInputFormat to be restricted to more than one data-center > ------------------------------------------------------------------ > > Key: CASSANDRA-8098 > URL: https://issues.apache.org/jira/browse/CASSANDRA-8098 > Project: Cassandra > Issue Type: Improvement > Reporter: mck > Assignee: mck > > Today, using CqlInputFormat, it's only possible to > - enforce data-locality to one specific data-center, or > - disable it by changing CL from LOCAL_ONE to ONE. > We need a way to enforce data-locality to specific *data-centers*, and would > like to contribute a solution. > Suggested ideas > - CqlInputFormat (gently) calls describeLocalRing against all the listed > connection addresses and merge the results into one masterRangeNodes list, or > - changing the signature of describeLocalRing(..) to describeRings(String > keyspace, String[] dc) and having the job specify which DCs it will be > running within. > *Long description* > A lot has changed since CASSANDRA-2388 that has made life a lot easier with > integrating c* and hadoop, for example: CqlInputFormat, CL.LOCAL_ONE, > LimitedLocalNodeFirstLocalBalancingPolicy, vnodes, and describe_local_ring. > When using CqlInputFormat, if you don't want to be stuck within > datacenter-locality you can for example change the consistency level from > LOCAL_ONE to ONE. That's great, but describe_local_ring + CL.LOCAL_ONE in its > current implementation isn't enough for us. We have multiple datacenters for > offline, multiple for online, because we still want the availability > advantages that come from aligning virtual datacenters to physical > datacenters for the offline stuff too. That is using hadoop for aggregation > purposes on top of c* doesn't always imply one can settle with an CP solution. > Some of our jobs have their own InputFormat implementation that uses > describe_ring, LOCAL_ONE, and data with only replica in the offline > datacenters. Works very well, except the last point kinda sucks because we > have online clients that want to read this data and have to then do so > through nodes in the offline datacenters. Underlying performance > improvements: eg cross_node_timeout and speculative requests; have helped but > there's still the need to separate online and offline. If we wanted to push > replica out on to the online nodes, i think the best approach is for us is to > have to filter out those splits/locations in getRangeMap(..) > Back to this issue we also have jobs using CqlInputFormat. Specifying > multiple client input addresses doesn't help take advantage of the multiple > offline datacenters because the Cassandra.Client only makes one call to > describe_local_ring, and StorageService.describeLocalRing(..) only checks > against its own address. It would work to have either a) CqlInputFormat call > describeLocalRing against all the listed connection addresses and merge the > results into one masterRangeNodes list, or b) something along the lines of > changing the signature of describeLocalRing(..) to describeRings(String > keyspace, String[] dc) and having the job specify which DCs it will be > running within. -- This message was sent by Atlassian JIRA (v6.3.4#6332)