Andreas Korneliussen (JIRA) wrote:
[ http://issues.apache.org/jira/browse/DERBY-937?page=all ]
Andreas Korneliussen updated DERBY-937:
---------------------------------------
Attachment: DERBY-937.diff
DERBY-937.stat
I was able to reproduce this problem on every run on a very fast laptop (it was
not reproducible on any other Windows lab machine I have tried). This laptop
happens to be the same kind as Ole uses for the nightly tests.
Adding a 10 second sleeping period after the population of the tables, did not
have any effect. I therefore tried to do run a compress on the tables (based on
the assumption that statistics is updated on compress), and now the test does
not fail for me anymore.
Attached is the patch which makes this test stop failing. The patch does not
seem to have any sideeffects on other platforms (Solaris) tested, however the
test will use more time.
It is great that this fixes the problem on your reproducible case, I say
we should commit. But I don't know why it does. To me it looks like
the loader program always first inserts all the data into the base
tables and then creates the indexes. In this case Derby automatically
creates the "statistics" that you mention here, and they should be
no different than what is recreated when you do the compress. Also
derby will create "packed" indexes in this case which again should look
no different than what compress will do (since all that has happened
are inserts into the table - if deletes or updates were involved it
would be a different story).
What the change does do is alter the cache a lot, and alter the I/O
timing a lot.
Also I wanted to again make sure people understand what statistics
mean in derby.
What is usually called "statistics" in other db's comes in 3 forms
in derby:
1) key distribution information
o this is automatically maintained by derby by using existing
indexes, no user interaction is ever needed and it is always up
to date. Basically we use the index itself to estimate a given
number of rows for a range of keys. Most other db's I know require
some sort of "update statistics" to get this info.
2) base table row counts
o also automatically maintained by derby, but can be slightly out
of date as it is only an estimate for performance reasons. Row
counts are updated at insert/delete time on a per page basis
automatically, but we delay the rollup. The rollup is done
always when the page goes to disk. The rollup may also be done
earlier if the change is significant to the total - I think if
delta is 10% or more of the table. The row count estimate is also
updated if they system ever happens to do a full scan on the
table.
3) cardinality statistic
This is basically one number which is the average number of duplicate
keys for a given key, where key is a leading set of columns in an
index. So a one key index has one number and an index with 3
columns (a, b, c) has 3 numbers indicating the cardinality of (a), (a,
b) and
(a, b, c).
o This statistic is automatically created when an index is created,
and also when a number of "bait/switch" ddl operations are done like
compress table. There have been a number of discussions on the list
on how to automate the update of this. My opinion is that we should
attack it in the following 3 ways:
o provide a way for users to schedule it to be calculated, other than by
running compress. Not greate for a zero-admin - just a work around
until we can better automate it. The code exists, just not shown
through to users right now.
o Develop a back ground "zero-admin" component which would
programtically figure out how and when to schedule this kind of activity.
o see if there is any smart quick way to estimate the values
automatically by doing some statistical analysis of the the btree
leaf pages.