[Rails] Re: Best way to handle multiple tables to replace one big ta

Mike P. Sat, 24 Apr 2010 16:45:57 -0700

Marnen Laibow-Koser wrote:
> Mike P. wrote:
>> Marnen Laibow-Koser wrote:
>>> Mike P. wrote:
>>>> Hello,
>>>> 
>>>> I've decided to use multiple tables for an entity (e.g. "todo_items1,"
>>>> "todo_items2," "todo_items3," etc.), instead of just one main table
>>>> which could end up having a lot of rows (e.g. just "todo_items"). I'm
>>>> doing this to try and to avoid a potential future performance drop that
>>>> could come with having too many rows in one table.
>>> 
>>> Robert already said it, and I agree: DO NOT DO THIS.  Any decent DB 
>>> system is designed to handle tables in the millions of rows without 
>>> breathing hard.  Just index your tables properly and make sure your 
>>> queries are efficient.
>>> 
>>> If you have to use your DB's sharding features at some point, go for it. 
>>> But don't even think of doing this in the app layer.
>>> 
>>> Best,
>>> --
>>> Marnen Laibow-Koser
>>> http://www.marnen.org
>>> mar...@marnen.org


Hi Marnen,

Thanks again for your response. I was thinking about it last night, and 
did some more research and a few calculations, and I'm inclined to agree 
with you and stick with the single table model. It will make things 
easier and the database should be able to handle it for a while.

However, I'm going to reply below as if I'm sticking with the 
multi-table model, mostly to see how I really feel about it by typing 
some of my thoughts (it sounds kind of weird, but hopefully it makes 
sense).

Honestly, I still think that splitting the main table into smaller ones 
will have future benefits. Even though there's a bit of extra work up 
front, I'd rather get this small thing over with rather than letting it 
turn into a bigger thing, sooner, in the future.

I think if people could just get over the "don't optimize too early" 
mantra, and realize that this can't possibly be best move for everyone, 
a lot of future stress could be avoided, for both the business owner and 
the customer.

Personally, I'd probably change that saying to read "don't optimize *too 
much* too early." As they say, an "ounce of prevention is worth a pound 
of cure..." it makes sense in the database optimization world too. :)

This multi-table thing isn't meant to completely avoid the need to do 
further optimizations, by the way, and I'm aware that I shouldn't go 
overboard with too much upfront optimization. I know that future, most 
likely database-layer, performance tweaks will probably be needed later. 
But it's *because* I know this that I was thinking about an easy-to-do 
change that I could do now, that will keep the performance better, for 
longer.

Anyway...

>> 
>> Hi Marnen,
>> 
>> Thank you also for your post and advice. It actually made me think more 
>> about this. So you're saying that even if I had millions of records in a 
>> table, and was noticing a decrease in performance, that I shouldn't 
>> split up the tables this way?
> 
> Correct, unless there is no better way to improve things (which will 
> almost certainly not be the case).
> 
>> Would a shard be the only option?
> 
> Depends on the DB.
> 

It's PostgreSQL running on a single database server, in case you were 
wondering.

>> 
> 
> 
>> One of the things that I keep coming back to when I think about the 
>> single-table option is that one of the fields will be rather large, like 
>> a blog-post in size. And I plan on indexing that to make it searchable. 
>> Also, there's a lot of insertions and deletions. Does it still make 
>> sense to include all of that in a single table? Or does that change any 
>> of your advice?
> 
> That does not change any of my advice -- right up until the point where 
> you have actual, measurable problems.  And by then, your data model may 
> be different anyway.  Don't prematurely optimize, because you may be 
> optimzing the wrong things.

I see your point here. I don't think the model for this table will 
change much though (I know, I know), but I do see your point. However, 
from the other side, even if the model does change, I could always just 
update all of those tables with the migration(s), which I'd still have 
to do with the one-big-table model.

> 
>> 
>> I agree that this is an application layer split, sort of; but I don't 
>> think that this particular split is a bad thing.
>> 
> 
> On what basis do you think that?  (I think you're wrong, but I want to 
> know your reasoning.)

Well, pretty much because it's not much different than accessing the 
rest of the tables in the database.

When I want to look up a user, I go to the User table; I may have 
another table for user preferences which I'd also have to access. 
Looking at todo items for a user involves me looking at a single table 
in both cases, the main difference between the multiple-tables vs 
single-tables is that there's an extra bit of indirection there.

It could be as simple as:

1) Looking at the "which_todo_table" column (which would be a number, 
I'm thinking)
2) Appending that number to the end of a string (so, "todo_items" << "3" 
becomes "todo_items3")

So, there's an extra column in the User table that tells which table to 
look at to get that user's todo items, and the app layer chooses that 
database when it needs to.

I agree that the database can handle this sort of split more efficiently 
than the application layer, especially when it comes to things like 
sharding or partitioning; but I'm looking at the multi-table model as a 
sort of "soft-partitioning." The data gets "split" up in the database 
(via tables), and the indexes stay at about a fifth of the size of the 
single-table methodology. Similar to partitioning, but without modifying 
the database in a way that Rails doesn't know about.

So, the difference is that I'd be using Ruby/Rails to "decide" which 
table to look at, instead of having a pre-set constant that's stored by 
Rails and used whenever a model is accessed.

With this, as a benefit, everything's at the application layer. I can 
see exactly what's going on, and if I move servers or something, I won't 
have to deal with the potentially complicated, database-specific 
sharding stuff that can only be seen by investigating the database 
layer.

It's just another table. That's how I'm looking at it.

> 
>> If I did this in pure Ruby, for example, I'm guessing that the effect 
>> would be minimal (just one class/method that looks up the correct table, 
>> and returns the object for that table). 
> 
> But it is still inappropriate, because this can be handled more 
> efficiently on the DB side than on the app side.

I agree, but it's also more complicated to have to set up the database 
to manage all of that stuff.  Handling things on the DB side adds 
another "moving part," so to speak. It will have to be managed, tested, 
and revisited when changes are being made.

Don't get me wrong, I'm not opposed to DB-layer tweaks and 
modifications, but if I can avoid/delay complicated DB-layer changes by 
doing some work upfront - especially in this case where the modification 
isn't database-level --, then I'll at least look into doing so.

The other main benefit is that DB performance will be much more 
resilient to surges of data. I'll have more time to implement those 
DB-level changes, if/when needed, and the decrease in performance will 
be much more gradual.

Considering that making a change such as this has a such a 
(theoretically) small impact on the app-layer codebase, has no effect on 
the database itself, and allows the database/app to perform better for 
longer - thereby delaying the slowdowns and the need for potentially 
complicated, not-easily-portable changes - isn't it worth looking into?

You must have seen people who are noticing their database performance 
dropping, and who are scrambling to fix their issues. They pop up in 
forums quite often, no? And how do they fix it? With more servers and/or 
complicated procedures; with data migrations, extra layers, and then 
downtime. All the while, they're feeling pressure because the 
performance lag is noticeable, and getting more noticeable as time 
passes.

It happens so often, everyone knows that it's going to happen, and yet 
trying to optimize a bit upfront is discouraged.

And, of course, in every DB performance upgrade push, there's another, 
even more important part of the story here: the users.

For however long it takes to get these changes implemented, and 
sometimes from even before the changes start, the users are noticing 
these slowdowns, and possibly even time-outs. They're seeing notices on 
the site about how the company is working to make the site faster, blah, 
blah, blah. They don't look at it as "oh, yeah, they followed the 
hard-and-fast rule about not optimizing too early, there's nothing they 
could've done to prevent this, so it's okay, this is just what happens." 
Instead, it looks like they didn't plan this properly, or that they 
weren't prepared.

So then, my question is, why -- if people can expect these performance 
issues to happen as a site grows -- why is it so frowned upon to try and 
do a little bit of upfront work to delay that from happening... and also 
when it starts to happen, to have better performance during that period?

> 
>> What's (potentially) making the 
>> app layer bloated is all of the unused 'has_many' attributes since I'm 
>> trying to stick with Rails. I really do want to be able to take 
>> advantage of the nice Rails methods and functionality... I'm just trying 
>> to avoid having a bunch of unused associations for each user object.
>> 
> 
> Then don't try to use a silly schema like the one you proposed.  Perhaps 
> you will ultimately need something like MySQL's MERGE table, but you 
> don't need to worry about that yet.
> 

Not yet, true. But it'll come. So why not do something about it? In that 
case, do you have any other recommendations for something small that can 
be done upfront to keep performance up in the future? I've already got 
the indexes covered... anything else? Or should I seriously just do 
nothing but the indexing?

>> Is there maybe a way to unload a 'has_many' association for an object? 
>> Or perhaps call 'has_many' inside of a user-specific method or 
>> something? Like
>> 
>> if @current_user.todo_table_number == 1
>>   has_many :todo_items1
>> elsif @current_user.todo_table_number == 2
>>   has_many :todo_items2
>> elsif ...
>> 
> 
> Don't bother.
> 

Back to my question: why not? It's just a small preventative measure. 
Mind you, if I was attempting to do some multi-server sharding right 
now, then you'd be totally right.

>> 
>> Anyway, the reason why I'm trying to stay away from the database-layer 
>> stuff is because they seem to be a bit 'much' right now, 
> 
> Right!  They are a bit much right now -- because they are premature 
> optimization.  When you need them, they will be the right tools.
> 

But not all premature optimizations are the same. Would you tell someone 
not to worry about indexing until they get to a point where they have 
the metrics (i.e. a few hundred or thousand users/rows)? The database 
will work well until they get a few hundred/thousand rows without 
indexing, yet people are told to take care of that upfront.

This isn't a super-crazy thing I'm trying to do here, just some 
optimization that would help at the million+ mark, instead of at the 
thousand+ mark.

>> and also can 
>> apparently cause problems. For example, even adding foreign keys at the 
>> database-layer can evidently cause issues with testing, and sometimes 
>> they get lost in db:push/db:pull calls.
> 
> You apparently are passing on FUD you've heard somewhere.  I assure you 
> -- from experience -- that this is completely untrue.
> 

Well, it's not hearsay if that's what you're saying. I could have been 
reading something that doesn't apply to Rails 2.3.5 or Rails 3, but one 
of them was from someone using a specific host and, after doing a 
db:push or db:pull, the foreign keys that he setup were gone. The issue 
was due to a plugin that didn't support foreign keys.

As for the testing issue, it's something to do with the fixtures not 
being loaded in the proper order, and as a result, the Test database 
wasn't getting setup properly. That looked like it needed a patch or 
something to fix it, so maybe it wasn't a super big deal.

Besides that, as far as I know (and I could be wrong) Rails doesn't 
support foreign keys out of the box. That's why plugin's like Foreigner 
exist. I don't have the links for these offhand, but if you'd like them, 
I can go back and find them. I read about both of these within the past 
week.

But this is besides the point...

>> 
>> Okay, how about this. Let's say that instead of my original post, I 
>> posted saying that I have a large table with 7 million+ rows that keeps 
>> growing. I had already split up the tables in the manner mentioned in 
>> the original post. Also, let's say that I really wanted to stick with 
>> the-Rails-way and was opposed to sharding. What would your advice be 
>> then?
> 
> I would advise you to get over your anti-sharding bias and do things as 
> I outlined above.
> 

Well, it's not so much an anti-sharding bias, it's an attempt to keep 
the app/database cleaner for longer, to keep the app/database performing 
better for longer, and to avoid upsetting customers with decreased 
performance by giving myself some additional time to make any required 
drastic changes like sharding.

Sharding and other complex/time-consuming optimizations should indeed be 
left until the metrics show that it's going to be needed soon. I agree 
with that part. But sharding is also a few orders of magnitude more 
complex then what I'm trying to do.

I certainly don't agree that every optimization should be supported by 
data and held off until then, especially when we all know what happens 
when database performance goes down.

Waiting until a known potential problem (i.e. large table performance 
drops) is actually a problem has consequences. I'm just trying to delay 
it. Splitting up the tables will work, and it's not that complicated and 
keeps the supporting code at the application level.

So, any best practices or tips for how to support/implement this 
multiple table model would be great, and super appreciated.


>> 
>> Thanks again for your help,
>> Mike
> 
> Best,
> --
> Marnen Laibow-Koser
> http://www.marnen.org
> mar...@marnen.org

Thanks again,
Mike


-- 
Posted via http://www.ruby-forum.com/.

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To post to this group, send email to rubyonrails-t...@googlegroups.com.
To unsubscribe from this group, send email to 
rubyonrails-talk+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/rubyonrails-talk?hl=en.

[Rails] Re: Best way to handle multiple tables to replace one big ta

Reply via email to