Re: How does Django handle auto-increment PK when a model is sharded horizontally into multiple databases?

Russell Keith-Magee Mon, 23 Aug 2010 04:48:29 -0700

On Mon, Aug 23, 2010 at 7:32 PM, Andy <selforgani...@gmail.com> wrote:
> On Aug 20, 10:04 pm, Russell Keith-Magee <russ...@keith-magee.com>
> wrote:
>
>>Of course, given that you know your sharding scheme, you could use the
>>router directly.
>>
>>Tweet.objects.using(router.db_for_read(Tweet, author=a)).filter(author_id=a)
>
> Ah Thanks. This is what I need.
>
>
>> You won't get any argument from me. What's missing is a clear
>> suggestion on how we can encompass this problem in the general case.
>> Suggestions are welcome.
>
> One suggestion I have is that any arguments in filter() should also be
> passed as part of the hints dictionary to the database router. The
> arguments in filter() should be enough information to determine which
> shard to select in most cases.
>
> So in the above example:
>
> Tweet.objects.filter(author_id=a)
>
> The keyword:value pair {'author_id': a} should be added to the hints
> that got passed to the database router.
>
> This achieves basically the same results as doing
> Tweet.objects.using(router.db_for_read(Tweet,
> author_id=a)).filter(author_id=a)
> But it doesn't require going through the entire codebase and modifying
> every single queryset so it's less prone to error and more DRY.


Ok, so how are the following queries processed?

Tweet.objects.filter(author_id=a, other=b)

Tweet.objects.filter(author_id=a).filter(other=b)

Tweet.objects.filter(author_id=a).exclude(other=b)

Like I keep saying - we've given this some thought, and it's easy to
solve this for the simple case. It's the general case that poses a
problem.

This also steps around the fact that we don't actually store the
contents of filter() clauses once they're applied; they're converted
into query-specific representations as soon as they're added to a
queryset.

> I have another use case where I want to shard by primary pk which is
> an auto-increment. I have this model:
>
> class Auction(models.Model):
>    seller_id = models.IntegerField()
>    text = models.TextField()
>    price = models.DecimalField()
>
> The PK of Auction is the auto-increment "id" field.
>
> Say I divide Auction into 3 shards and set up each shard so that the
> auto-increment id's don't collide.
>
> When I first create and save a new auction, it doesn't have an "id",
> so I just want to randomly pick a shard to save to:
>
>    def db_for_write(self, model, **hints):
>        if model.__name__ == "Auction" and 'instance' not in hints:
>            return random.choice(['shard1','shard2', 'shard3'])
>
> Would the above work?

Well, it will certainly work in the sense that you will write the
instance to a random shard. The issue is whether you will be able to
reliably retrieve the objects afterwards. That comes down to exactly
how your auto-increment scheme allocates primary keys. If you can
guarantee that the primary keys are allocated in a programatic way,
then it will probably work.

> Also how random is random - would I get a uniform distribution of
> records among the shards?

Depending on your level of mathematical rigor, that's not a simple
question. To a simple approximation, yes, you'll get a uniform
distribution. However, the patterns and underlying distribution of
random number generators is the subject of continued research and
improvement

Yours,
Russ Magee %-)

-- 
You received this message because you are subscribed to the Google Groups 
"Django users" group.
To post to this group, send email to django-us...@googlegroups.com.
To unsubscribe from this group, send email to 
django-users+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/django-users?hl=en.

Re: How does Django handle auto-increment PK when a model is sharded horizontally into multiple databases?

Reply via email to