Re: Sharding support across hosts and tables

Nate Wiger Thu, 03 Jun 2010 08:18:06 -0700

> For creating entries, something similar for _insert_dataset should
> work, assuming you know the id before saving.


I think this is a key piece that is overlooked in sharding and glad
you pointed it out.  As one solution, if you use UUID's for pk's the
edge hosts can create them, and they can be hashed before all
operations - including inserts - so you know what shard they wind up
on.  The downside of UUID's is the extra storage vs an int.

> For queries that need access multiple databases, you'd obviously have
> to use separate queries.  I'd probably do separate queries for each
> table and server, and combine the results together:
>
>   rows = []
>   tables = [:table_a, :table_b, :table_c, :table_d]
>   shards = [:shard_a, :shard_b, :shard_c, :shard_d]
>   tables.each do |t|
>     shards.each do |s|
>       rows.concat DB[t].server(s).all
>     end
>   end
>   rows

Good description. As you said the problem I have is basically turning
this:

   select * from creations where id in (1,2,3,4,5,6)

Into:

   server1:
     select * from creations where id in (1,3,5)
   server2:
     select * from creations where id in (2,4,6)

And then collating.  In the simple case, your solution works well.

Trying to do it with something like this gets a bit uglier:

   select * from creations where score > 100 order by score desc

You can't just concat the two because the individual sorts may
interleave, so you have to do the sorting in the app layer.  So:

   rows = []
   [:server1, :server2].each do |s|
     rows += DB['creations'].server(s).filter('score > 100').all
   end

Then:

   results = rows.sort_by{|r| -r.score}

But these problems can turn impossible quickly, like trying to
paginate, or only find the "top 10" results.  You end up retrieving
tons of excess data and throwing it away (eg, top 10 from all shards,
then re-parsing to see who the real top 10 are).  Probably better to
spend the time tuning the DB and saving up for a larger server in most
cases.  :-)

I've also thought about an approach where you build indexes somewhere
other than your data store, then use after_save hooks to keep them up
to date.  For example, you could use Redis to keep extremely fast
indexes, which could contain the server and table the record is in.
So:

   1. Perform a sort operation on a Redis data structure
   2. Read those results which give [server, table, id]
   3. Submit select by ID queries to the appropriate DB server(s)
   4. Collate / sort in the Ruby layer

The downside is you have to make many many network calls, which can
get really expensive.  You could also do the reverse: Store your data
as a JSON structure in Redis/Tokyo Cabinet and use mysql for the
indexes.  This way you would have a richer set of math opts - sum(),
max(), etc - but you'd have even more network fetches from Redis/etc
since you can only get one key at a time.

Anyways, I've wanted to prototype this to see how it works out but
haven't had a chance.  Guess this is going a bit off-topic for Sequel
at this point.  :-)

-Nate

-- 
You received this message because you are subscribed to the Google Groups 
"sequel-talk" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/sequel-talk?hl=en.

Re: Sharding support across hosts and tables

Reply via email to