[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

Sylvain Lebresne (JIRA) Fri, 08 May 2015 02:29:56 -0700

    [ 
https://issues.apache.org/jira/browse/CASSANDRA-9231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14534202#comment-14534202
 ]


Sylvain Lebresne commented on CASSANDRA-9231:
---------------------------------------------

What I'm talking about is basically the idea of CASSANDRA-5054. Or to put it 
another way, we could use a function like:
{noformat}
CREATE FUNCTION myTokenFct(a int, b int) RETURNS bigint AS 
$$
    long high = murmur3(a);
    long low = murmur3(b);
    return (high & 0xFFFFFFFF00000000) | (low & 0x00000000FFFFFFFF);
$$;
{noformat}
The goal being to make it likely that partitions with the same value for {{a}} 
are on a small amount of nodes but without forcing everything on the same node 
(the latter having a fair amount of foot-shooting potential). But that's really 
just an example. You could imagine to actually have a specific table that is 
"ordered" (in a predictable way) without having to use {{ByteOrderPartitioner}} 
for the whole cluster:
{noformat}
CREATE FUNCTION myOrderedTokenFct(a bigint) RETURNS bigint AS 'return a';
CREATE TABLE t (
   a int PRIMARY KEY,
   b text,
   c text
) with tokenizer=myOrderedTokenFct;
{noformat}

Basically, this gets you very close to a per-table partitioner. The actual 
partitioner would just define the "domain" of the tokens and how they sort, but 
the actual computation would be per-table. And this for very, very little 
change to the syntax and barely more complexity code-wise than the "routing 
key" idea.

Of course, this will be an advanced feature that people should use at their own 
risk.  But that's true of the "routing key" idea too: we'd better label it as 
an advanced feature or I'm certain people will misuse it and shoot themselves 
in the foot more often than not. This is also why I'm not too worried about the 
drivers parts: it's simple to say that if you use a custom token function, 
which will be rare in the first place, then you have to provide it to the 
driver too to get token awareness (which is not saying that this isn't a small 
downside, but it's a very small one in practice and given the context).

Perhaps more importantly, I think the function idea is conceptually *simpler* 
than the routing key idea. All that you basically have to say is that we allow 
you to define the {{token}} function on a per-table basis, the exact same 
function that already exists and can be used in {{SELECT}}.

While the routing key concept (or whatever name we would pick) is imo more 
confusing. You have to explain that on top of the _primary key_ having a 
subpart that is the _partition key_, you also have a subpart of the latter 
which is now the _routing key_. And how do you define what the _partition key_ 
is now in simple terms? Well, I don't know, because once you have a routing key 
that is different from the partition key, the partition key start to be kind of 
an implementation detail. It's the "thing" that don't really determine where 
the row is distributed, but is not part of the clustering so you can't query it 
like a clustering column because ... because?

Honestly, allowing to provide custom {{token}} function per table is 1) more 
powerful and 2) imo way more easy to explain conceptually and this without 
fuzzing existing concept. So I'm a -1 on the routing key concept unless I'm 
proved that the custom {{token}} function idea doesn't work, is substantially 
more complex to implement or has fundamental flaws I have missed. I would hate 
to add the routing key idea to realize that some other user has a clever 
"routing" idea that is just not handled by the routing key (and having to add 
some new custom concept).

bq. the distinct concept of "token" (which is more an implementation detail, 
IMO)

Your opinion are your own, but the "token" is most definitively *not* an 
implementation detail since 1) we have a {{token}} function in CQL to compute 
it and 2) we reference it all the time in the documentation, have scores of 
options that mention it, it's exposed by drivers, etc... Actually, the fact 
that we would use the token concept rather than adding a new custom one is part 
of why I'm convinced it's conceptually simpler: everyone that knows Cassandra 
knows of tokens.


> Support Routing Key as part of Partition Key
> --------------------------------------------
>
>                 Key: CASSANDRA-9231
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-9231
>             Project: Cassandra
>          Issue Type: Wish
>          Components: Core
>            Reporter: Matthias Broecheler
>             Fix For: 3.x
>
>
> Provide support for sub-dividing the partition key into a routing key and a 
> non-routing key component. Currently, all columns that make up the partition 
> key of the primary key are also routing keys, i.e. they determine which nodes 
> store the data. This proposal would give the data modeler the ability to 
> designate only a subset of the columns that comprise the partition key to be 
> routing keys. The non-routing key columns of the partition key identify the 
> partition but are not used to determine where to store the data.
> Consider the following example table definition:
> CREATE TABLE foo (
>   a int,
>   b int,
>   c int,
>   d int,
>   PRIMARY KEY  (([a], b), c ) );
> (a,b) is the partition key, c is the clustering key, and d is just a column. 
> In addition, the square brackets identify the routing key as column a. This 
> means that only the value of column a is used to determine the node for data 
> placement (i.e. only the value of column a is murmur3 hashed to compute the 
> token). In addition, column b is needed to identify the partition but does 
> not influence the placement.
> This has the benefit that all rows with the same routing key (but potentially 
> different non-routing key columns of the partition key) are stored on the 
> same node and that knowledge of such co-locality can be exploited by 
> applications build on top of Cassandra.
> Currently, the only way to achieve co-locality is within a partition. 
> However, this approach has the limitations that: a) there are theoretical and 
> (more importantly) practical limitations on the size of a partition and b) 
> rows within a partition are ordered and an index is build to exploit such 
> ordering. For large partitions that overhead is significant if ordering isn't 
> needed.
> In other words, routing keys afford a simple means to achieve scalable 
> node-level co-locality without ordering while clustering keys afford 
> page-level co-locality with ordering. As such, they address different 
> co-locality needs giving the data modeler the flexibility to choose what is 
> needed for their application.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (CASSANDRA-9231) Support Routing Key as part of Partition Key

Reply via email to