Re: [sqlite] SQLite and Columnar Databases

Joe Wilson Sat, 15 Dec 2007 11:00:26 -0800

--- [EMAIL PROTECTED] wrote:
> The problem is when inserting into large database that is
> indexed, the values being indexed are randomly distributed.
> So with each insert, SQLite has to seek to a new random
> place in the file to insert the new index entry there.
> It does not matter that pages of the index are not in
> consecutive order.


For bulk INSERTs I see your point.

But for SELECT performance on the resulting database in a cold OS file 
cache situation (assuming you do not have the luxury of running VACUUM 
beforehand) or when the database file is many times the size of RAM, it 
does improve SELECT performance if the pages of each index are contiguous.
It would be nice if contiguous index pages could be a fortunate "by-product" 
of the bulk INSERT optimization algorithm for this reason.

>  What matters is that each insertion
> is into a different place and that the places are randomly
> distributed over a large file - larger than will fit in cache.
> In this situation, each new entry probably needs to go on 
> a page that is not in cache, and hence a real disk
> seek and read is required (as opposed to simply reading
> the value out of cache) when inserting each new entry.  Disk
> seeks and reads are much, much slower than disk cache hits,
> which really slows down insertion performance.
> 
> If you do many inserts such that the indexed values are
> in sorted order, then most inserts will go on the same page
> as the previous.  The previous page is already in cache
> so it doesn't need to be read from disk.  It has also
> already been journaled, so no excess writing is required.
> Disk I/O is greatly reduced. Things go much, much faster.
> The problem is that you really have the luxury of being
> able to insert entries in sorted order.  And if you are
> indexing multiple columns, it is impossible to sort
> the entries on both columns at once.
> 
> The usual way to work around this problem is to only do random
> inserts into indices which are small enough to fit into
> your disk cache.  Suppose you start inserting into index A.
> Once A gets too big to fit entirely in cache, stop inserting
> into it and start inserting into B.  Once B gets to be the
> cache size, merge A and B together into a new index C.
> The merge operation requires reading A and B straight through
> from beginning to end once.  This is a lot of disk I/O but
> it is still much faster than jumping around within the file
> reading bits here and there.  After creating C, reset A and
> B back to being empty (since all records have been transfered
> into C).  Start inserting into A again until it fills up.
> Then fill up B again.  Merge A and B into D, then merge C and D
> into E.  Reset A, B, C, and D.  Keep doing this, merging
> smaller indices into larger indices, until you insert all
> the records you need to insert.  Then make a single pass
> through all of the smaller indices and merge them all
> together into a single big index Z.  Z becomes the new
> index used for search operations on the database.
> 
> The algorithm above should probably go into SQLite 
> to construct an index as part of the CREATE INDEX
> statement.  When populating a database will a large
> amount of data, first put all of the data into an
> unindexed table.  This will be very fast because each
> new entry goes at the end of the table (and also at the
> end of the file.)  After all data is in place, issue
> the CREATE INDEX statements to index any fields you
> think need indexing.  The CREATE INDEX statement has
> to populate the index.  The current algorithm is to
> scan through the original table and create and insert
> index entries one by one as they are encountered.  I am
> proposing that you substitute the algorithm outlined in
> the previous paragraph in place of the current algorithm.

Are you assuming that you do not have any indexes already in place on 
the table being populated?

It would be nice if the optimization could work in the general case
where the table already has data and indexes (implicit or explicit) 
as in:

  create foo (
    a UNIQUE,
    b UNIQUE,
    c,
    d,
    primary key(d, b, c)
  );

Or are you suggesting that you first determine whether you're running 
INSERTs within a larger transaction and perform the new table/index population 
algorithm for each INSERT and only combine the various indexes at the
end of the transaction or whenever a page threshold is exceeded?

Guessing the end of INSERTs into the same table could be tricky within a 
larger transaction. End might be defined as when you're running an SQL 
command other than an insert into the same table, as in:

  BEGIN;
  INSERT INTO foo VALUES ...;
    -- detect that we've started inserting into table foo
    -- and begin the bulk insert optimization
  INSERT INTO foo VALUES ...;
  ...a million other inserts into foo...
  INSERT INTO foo VALUES ...;
  INSERT INTO foo VALUES ...;
  INSERT INTO foo VALUES ...;
    -- detect that we've stopped inserting into foo and combine the
    -- various sub-indexes ?
  INSERT INTO another VALUES ...; 
  ... unrelated SQL commands ...
  END;




      
____________________________________________________________________________________
Never miss a thing.  Make Yahoo your home page. 
http://www.yahoo.com/r/hs

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Re: [sqlite] SQLite and Columnar Databases

Reply via email to