Andreas Korneliussen (JIRA) wrote:
     [ http://issues.apache.org/jira/browse/DERBY-937?page=all ]

Andreas Korneliussen updated DERBY-937:
---------------------------------------

    Attachment: DERBY-937.diff
                DERBY-937.stat

I was able to reproduce this problem on every run on a very fast laptop (it was 
not reproducible on any other Windows lab machine I have tried).  This laptop 
happens to be the same kind as Ole uses for the nightly tests.

Adding a 10 second sleeping period after the population of the tables, did not 
have any effect. I therefore tried to do run a compress on the tables (based on 
the assumption that statistics is updated on compress), and now the test does 
not fail for me anymore.

Attached is the patch which makes this test stop failing. The patch does not 
seem to have any sideeffects on other platforms (Solaris) tested, however the 
test will use more time.

It is great that this fixes the problem on your reproducible case, I say we should commit. But I don't know why it does. To me it looks like the loader program always first inserts all the data into the base tables and then creates the indexes. In this case Derby automatically
creates the "statistics" that you mention here, and they should be
no different than what is recreated when you do the compress.  Also
derby will create "packed" indexes in this case which again should look
no different than what compress will do (since all that has happened
are inserts into the table - if deletes or updates were involved it would be a different story).

What the change does do is alter the cache a lot, and alter the I/O timing a lot.

Also I wanted to again make sure people understand what statistics
mean in derby.
What is usually called "statistics" in other db's comes in 3 forms
in derby:

1) key distribution information
o this is automatically maintained by derby by using existing indexes, no user interaction is ever needed and it is always up
to date.  Basically we use the index itself to estimate a given
number of rows for a range of keys.  Most other db's I know require
some sort of "update statistics" to get this info.

2) base table row counts
   o also automatically maintained by derby, but can be slightly out
     of date as it is only an estimate for performance reasons.  Row
     counts are updated at insert/delete time on a per page basis
     automatically, but we delay the rollup.  The rollup is done
     always when the page goes to disk.  The rollup may also be done
     earlier if the change is significant to the total - I think if
     delta is 10% or more of the table.  The row count estimate is also
     updated if they system ever happens to do a full scan on the
     table.

3) cardinality statistic
   This is basically one number which is the average number of duplicate
   keys for a given key, where key is a leading set of columns in an
index. So a one key index has one number and an index with 3 columns (a, b, c) has 3 numbers indicating the cardinality of (a), (a, b) and
(a, b, c).
o This statistic is automatically created when an index is created, and also when a number of "bait/switch" ddl operations are done like
compress table.  There have been a number of discussions on the list
on how to automate the update of this.  My opinion is that we should
attack it in the following 3 ways:
o provide a way for users to schedule it to be calculated, other than by
running compress. Not greate for a zero-admin - just a work around until we can better automate it. The code exists, just not shown through to users right now.

o Develop a back ground "zero-admin" component which would programtically figure out how and when to schedule this kind of activity.

o see if there is any smart quick way to estimate the values automatically by doing some statistical analysis of the the btree
leaf pages.

Reply via email to