[ 
https://issues.apache.org/jira/browse/CALCITE-2141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17030081#comment-17030081
 ] 

Christian Beikov commented on CALCITE-2141:
-------------------------------------------

I thought of something similar for querying i.e. introduce a sub-table per 
partition and combine that into a view through UNION ALL like Julian proposed. 
I think that's a pretty simple and straighforward implementation, but it 
depends on the optimizer to push down predicates and be able to remove 
unnecessary set operands, which is great as implementing these optimizations 
will help in general.

If your student would also like to work on the writing part, it would be 
awesome if writing could use the same technique i.e. implement writable views. 
Internally, the expansion could be represented through writable CTEs.

{{UPDATE sometable t SET t.col = t.col || t.partiton WHERE MOD(t.partition, 
2)=0}} would be rewritten to something like the following

 
{code:java}
WITH update1 AS(

  UPDATE sometable_1 t SET t.col = t.col || t.partiton WHERE MOD(1, 2)=0 
RETURNING *

), update2 AS(

  UPDATE sometable_2 t SET t.col = t.col || t.partiton WHERE MOD(2, 2)=0 
RETURNING *

)

SELECT SUM(t.c) FROM (

  SELECT COUNT(*) c FROM update1 u1

  UNION ALL

  SELECT COUNT(*) FROM update2 u2

) t
{code}

> Implement query rewrite based on sharding configuration
> -------------------------------------------------------
>
>                 Key: CALCITE-2141
>                 URL: https://issues.apache.org/jira/browse/CALCITE-2141
>             Project: Calcite
>          Issue Type: New Feature
>            Reporter: Christian Beikov
>            Priority: Minor
>
> Based on topology changes, it should be possible to dynamically update a 
> sharding configuration for calcite. The effect of such a configuration is, 
> that a query involving sharded tables is rewritten to a more optimal form 
> possibly targetting mutliple different datasources.
> This is an interesting building block for distributed databases but also for 
> applications since it enables the implementation of a static sharding scheme. 
> Doing the shard rewriting on a client is also a lot better when using a 
> distributed database as that eliminates the need for a coordinator node 
> through which data is tunneled.
> Also see [https://github.com/shardingjdbc/sharding-jdbc] for an existing 
> implementation.
> So imagine a topology with a master node and 2 worker nodes, one having 
> shards with even numbers and the other one having shards with odd numbers. 
> Table "A" is sharded by the column "tenant_id" into e.g. 30 shards. So the 
> sharding configuration for table "A" would contain the information "worker 1 
> has shards 1,3,5,..." and "worker 2 has shards 0,2,4,...". It also specifies 
> that the sharding strategy should use a hash function {{shard = tenant_id % 
> 30}}.
> When an application sends a query like e.g. {{select * from A where tenant_id 
> = 1}}, normally the master/coordinator would do the shard rewriting, but 
> doing this already at the client can eliminate the master as bottleneck for 
> many scenarios. It is clear that the query can be completely fulfilled by 
> worker 1 since it owns shard 1. The query rewriting therefore simply pushes 
> the query to that worker. Note that there might be cases where a shard is 
> replicated to other workers so it might be beneficial to make it configurable 
> whether or when replicas should be used for querying.
> A query like \{{select * from A where tenant_id in(1,2)}} could be 
> transformed to {{select * from worker1.A where tenant_id = 1 union all select 
> * from worker2.A where tenant_id = 2}}. One optimization could be to target a 
> single worker if it contains at least a replica of all required shards, but 
> that would need to be configurable again since replicas might lag behind.
>  
> DML statements obviously should be handled as well but at first, I would 
> simply forbid the use of multiple workers within one transaction. Supporting 
> multiple workers in a transaction will essentially require a 2PC and I'm not 
> sure it's always a good idea to let an application be the coordinator for 
> such a transaction. There should definitely be an option to let a 
> master/coordinator node of a distributed database handle the details of the 
> 2PC by configuring that DML statements should always be pushed to the 
> master/coordinator node.
> The sharding-jdbc implementation only offers _BASE_ guarantees. I guess there 
> are cases where this makes sense so making the transaction handling pluggable 
> to allow other strategies would definitely be nice as well.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to