[ 
https://issues.apache.org/jira/browse/CASSANDRA-8098?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Aleksey Yeschenko resolved CASSANDRA-8098.
------------------------------------------
    Resolution: Won't Fix

Hadoop input/output formats are likely to move off-tree soon, and as such we 
aren't going to allocate any resources to new Hadoop-related functionality.

If you come up with a 3.x patch, however, feel free to reopen the ticket.

> Allow CqlInputFormat to be restricted to more than one data-center
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-8098
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8098
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: mck
>            Assignee: mck
>
> Today, using CqlInputFormat, it's only possible to 
>  - enforce data-locality to one specific data-center, or
>  - disable it by changing CL from LOCAL_ONE to ONE.
> We need a way to enforce data-locality to specific *data-centers*, and would 
> like to contribute a solution.
> Suggested ideas
>  - CqlInputFormat (gently) calls describeLocalRing against all the listed 
> connection addresses and merge the results into one masterRangeNodes list, or 
>  - changing the signature of describeLocalRing(..) to describeRings(String 
> keyspace, String[] dc) and having the job specify which DCs it will be 
> running within.
> *Long description*
> A lot has changed since CASSANDRA-2388 that has made life a lot easier with 
> integrating c* and hadoop, for example: CqlInputFormat, CL.LOCAL_ONE, 
> LimitedLocalNodeFirstLocalBalancingPolicy, vnodes, and describe_local_ring.
> When using CqlInputFormat, if you don't want to be stuck within 
> datacenter-locality you can for example change the consistency level from 
> LOCAL_ONE to ONE. That's great, but describe_local_ring + CL.LOCAL_ONE in its 
> current implementation isn't enough for us. We have multiple datacenters for 
> offline, multiple for online, because we still want the availability 
> advantages that come from aligning virtual datacenters to physical 
> datacenters for the offline stuff too. That is using hadoop for aggregation 
> purposes on top of c* doesn't always imply one can settle with an CP solution.
> Some of our jobs have their own InputFormat implementation that uses 
> describe_ring, LOCAL_ONE, and data with only replica in the offline 
> datacenters. Works very well, except the last point kinda sucks because we 
> have online clients that want to read this data and have to then do so 
> through nodes in the offline datacenters. Underlying performance 
> improvements: eg cross_node_timeout and speculative requests; have helped but 
> there's still the need to separate online and offline. If we wanted to push 
> replica out on to the online nodes, i think the best approach is for us is to 
> have to filter out those splits/locations in getRangeMap(..)
> Back to this issue we also have jobs using CqlInputFormat. Specifying 
> multiple client input addresses doesn't help take advantage of the multiple 
> offline datacenters because the Cassandra.Client only makes one call to 
> describe_local_ring, and StorageService.describeLocalRing(..) only checks 
> against its own address. It would work to have either a) CqlInputFormat call 
> describeLocalRing against all the listed connection addresses and merge the 
> results into one masterRangeNodes list, or b) something along the lines of 
> changing the signature of describeLocalRing(..) to describeRings(String 
> keyspace, String[] dc) and having the job specify which DCs it will be 
> running within.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to