Re: [HACKERS] Dreaming About Redesigning SQL

Lauri Pietarinen Thu, 23 Oct 2003 13:18:00 -0700

Anthony W. Youngman wrote:

In article <[EMAIL PROTECTED]>, Lauri Pietarinen <[EMAIL PROTECTED]> writes
Anthony W. Youngman wrote:
Fine. But MV *doesn't* *need* much of a cache. Let's assume both SQL and
MV have the same amount of RAM to cache in - i.e. *not* *much*. I did
say the spec said "extract maximum performance from the hardware
available".
So what's wrong with gettng a machine with lots of memory? How much does 2G of memory for an Intel-box cost now a days? Is this some kind of new ultimate sport, trying to get along with as little memory as possible?
I presume you didn't read the bit below ... what if you have SEVERAL
tables, and EACH of them is a gigabyte or two in size?

OK, I get your point.

Well, if it is normalised, how easy is it for you to change the customer_id of an order? Anyway,
Incredibly easy. Just update the "customer_id" field of the invoice
record. A single change to a single "row"

And I presume the system will automatically move all related stuff (order details etc.) into the same block as the new customer? How long will that take? What if there is no room for it there?

if we stick to your example and even if we don't normalise using e.g. clustering features of Oracle, as Bob pointed out, we are getting at most the same number of I/O's. So, answer to your question: our formula is at least as good as yours.
Except I think Bob said we could "optimise to favour *certain*
transactions". I think actually ANY transaction benefits. You're relying
on stuff that's outwith your theory, we're relying on stuff that's
inherent to our model.

That certainly is not true. The theory says NOTHING about how data should be arranged on disk. You are talking about how modern SQL-databases behave. The DBMS is at liberty to do whatever it pleases with the data, even save it in a PICK database. Hey, wadda you think? Would that be a good idea? We get to keep our SQL but with the speed of PICK ;-)

Now, that was a *conservative* estimate, and we assumed that we did not have any rows lying around in the (global!) cache. As the size of the cache grows

in

proportion to the size of the total database we can assume less and less disk I/O.

You're relying on the hardware to bale you out :-) We can do the same!

Well why don't you?
We let the hardware help us out if it can. There's a big difference. If
you can't get the hardware, you're stuffed. We don't need it, so while
we may have a hard time of it it's nowhere near as bad for us.
And again, relational separates the physical from the logical. You're
being hypocritical if you call upon the physical representation to help
out with the (speed of the) logical presentation.

My goodness, no I'm not! Its the same as claiming that if you have a drawing for a house, you have to make that house out of paper?!?

I want a list with all products with corresponding total sales, read

from order detail e.g.
Hammer  10000$
Nail           5000$
Screw       1200$
How many disk reads (or head movements)?
Actually, probably the same as you here.

If we're indexed on order
detail. If Hammer appears in N invoices, then T = (1+N) * ST * 1.05 for
hammers, and the same for all the other products.

Theory favours us, in that if a product appears X times in one invoice,
that's one read for us and X for you, but hardware will probably help
you more than us (that is, assuming thrashing cuts in) in that you stand
a marginally higher chance of getting multiple instances of a product in
any given read.

So for each product you get T = (1+N) * ST * 1.05.

Now, for our SQL-DBMS, presuming that we build indexes for detail and product:

order_detail(product_id, qty, unit_price)  = 20 bytes/row
product(product_id, product_name) = 50 bytes/row

With 2 disk reads I would get
8K/20 = 400 order detail rows and
8K/50 = 160 product rows

Since all rows are in product_id order, no need for random disk reads so
T =  1 + N/400 +  P/160  (N=number of details, P=number of products)
for ALL products and details.

And, because of sequential prefetch,  we probably would not have to wait
for I/O's at all.

Really, however you calculate it, it is an order of magnitude less
than your alternative.

And please don't tell me that using indexes is not fair or not in the spirit of the relational model ;-)

And: what if I was just reading customer-data. Would the same formula apply (= (2+N)*ST*1.05)?
Nope. If I understand you correctly, you want attributes that belong to
the entity "customer", not the entity "invoice". T = ST * 1.05. (By the
way, billing and/or invoice address (for example) are invoice
attributes, not company attributes.)
No, I want you to give me a list of all your customers. How many disk reads?


T = N * 1.05 where N is the number of customers. What do you want to
know about those customers? Address? Phone number*s*? Anything else?
That's *all* at no extra cost.

Well, no thanks. I just wanted their names this time. The relational alternative, with an index on customer_name, would be again an order of magnitune less disk reads.

But as I understand relational theory, such a question is completely
outside the scope of the theory. Seeing as it tries to divorce the
database logic from the practical implementation ...
The theory, indeed, does not say anything about buffer pools, but by
decoupling

logic

from implementation we leave the implementor (DBMS) to do as it feels fit to

do.

As DBMS technology advances, we get faster systems without having to change

our
programs.
But with MV, if our database is too large for current technology, we
kick the shit out of relational for speed ...

What is "too large"?

Don't forget. You've already said that, if nothing is cached, my average
case exceeds your best. And my case is *already* assuming that the
system is seriously stressed and struggling ...

It does?

When we design databases we can decouple logical planning from performance
considerations, which, you must agree, are two separate issues.

Yes. BUT what's the point of having a database that is logically
perfect, and who's performance is slow to the point of being unusable?

Don't forget - in practice MultiValue ends up with a database that is
*inherently* optimised such that it almost invariably outperforms an
equivalent SQL database, AND we don't normally have DBAs to help us
achieve that nirvana ...

Frankly, it may well be that PICK systems run faster and cheaper than relational ones, but certainly not for the reasons you state.

I can't find the post now :-( but is Christopher reading this? You know
I compared that relational system on a twin Xeon 800, to an MV system
running on a P90? Christopher made the (reasonable in the circumstances)
assumption that the relational consultants must be crap, and the MV guy
a guru. Actually, I'd come to exactly the OPPOSITE conclusion. My MV
experience tells me that MV query was probably thrown together, by an
average programmer, in 30 seconds. On the other hand, those SQL
consultants had an axe to grind and a point to prove. They couldn't
afford to let this "old fashioned" system beat them. That SQL query
would have been optimised to within an inch of its life over weeks.
Don't forget how proud they were to beat this MV system! Yet with
hardware that was so much more powerful and a query that was heavily
optimised, they had great difficulty beating a query that was thrown
together in seconds by an average MV guy (or even just a luser!).

Don't forget. I said I am a database *engineer*. Engineers believe in elegance, they believe in beauty. And when I look at relational, all I see is the theorists pleading "power", "hardware", "brute force", to get them out of trouble.

You said that logical planning and performance are separate issues. And
I wouldn't expect you to address the above example in a discussion of
relational, because performance is irrelevant to relational.

I would have to know a lot more details to address it properly. Performance is irrelevant to the model. It's like E=mc**2. Nice theory and it actually works. But to get performance out of it (=exploding bomb) you have to solve lots of practical details. However, without the theory you could experiment for a milloin years without being able to build an atom bomb.

But surely, the fact that I am SUPREMELY CONFIDENT that I can get
superior performance from inferior hardware should give you pause for
thought that maybe, just maybe, the relational model is flawed from an
engineer's or scientist's viewpoint?

That's OK with me. But the most you can claim is that todays IMPLEMENTATIONS are flawed, and you would be 100% correct. How would you go and prove that the model is flawed? You should prove that a relational DBMS could not POSSIBLY be efficient.

From the mathematician's (or logician's) viewpoint I agree it's
flawless. But that's true of plenty of broken scientific theories...

Could you give me some other examples?

best regards,
Lauri Pietarinen

Cheers, Wol

---------------------------(end of broadcast)---------------------------
TIP 4: Don't 'kill -9' the postmaster

Re: [HACKERS] Dreaming About Redesigning SQL

Reply via email to