On Thu, 30 Mar 2006, Christian Smith wrote:

> On Wed, 29 Mar 2006, Andy Spencer wrote:
> 
> >I have a sqlite database with about 3 GB of data, most of which is stored
> >in a data table with about 75 million records, having three columns
> >(EntryId INTEGER, PropertyId INTEGER, Value NUMERIC) and
> >PRIMARY KEY(EntryId, PropertyId).
> >
> >This table is not indexed, to allow faster updates.
> 
> 
> It is indexed. The primary key clause creates an implied index on
> (EntryId,PropertyId).
> 

You are right. I verified this with:

SQL statement: PRAGMA index_list(Data)
(seq) = (0)
(name) = (sqlite_autoindex_Data_1)
(unique) = (1)

SQL statement: PRAGMA index_info(sqlite_autoindex_Data_1)
(seqno) = (0)
(cid) = (0)
(name) = (EntryId)

(seqno) = (1)
(cid) = (1)
(name) = (PropertyId)


> 
> >
> >The problem is that it takes over an hour to access all Values, for a
> >specified PropertyId, when the value is obtained for each EntryId
> >separately (using "SELECT Value FROM Data WHERE PropertyId=? AND
> >EntryId=?", bound to the specified PropertyId and EntryId) and
> >the EntryId values for successive database queries are in essentially
> >random order (taken from an external list of entries that has been
> >sorted by property values).
> >
> >This same query (getting the property value for each EntryId,
> >separately) only takes about 7 minutes when the EntryId values for
> >successive database queries are in the same ascending order as
> >the data orginally inserted into the table.
> 
> 
> Yes. You're accessing the database in about as inefficient way as is
> possible with your data, resulting in much thrashing of caches. Under
> UNIX, if you're thrashing the OS cache, you can monitor this using vmstat.
> 
> 
> >
> >I assume that this has to do with better pager caching of successive
> >records in the database, whereas random access may re-read the same
> >page multiple times (due to the limited cache).
> 
> 
> If you're not thrashing the OS cache (do you have lots of RAM?) try
> increasing the size of your SQLite cache. Use:
> 
> PRAGMA cache_size=20000;
> 
> This will make your cache 10x bigger, and may increase hit rate.
> 

SQL statement: PRAGMA page_size
(page_size) = (1024)

SQL statement: PRAGMA default_cache_size
(cache_size) = (2000)

SQL statement: PRAGMA cache_size
(cache_size) = (2000)

So, our cached data is currently 2000 pages * 1024 bytes/page = 2 MB.

I think you are right, that it makes sense to increase the number of
cached pages, rather than the page size.

> 
> >
> >My question is whether it should be faster to
> >
> >A) create an index for the table before the query,
> >   query the value (for the specified PropertyId) for each EntryId
> >   (in essentially random order, from external list of entries),
> >   and delete the index after the queries (for each EntryId) are done
> 
> 
> Won't help. You already have an index from the primary key.
> 

Yes. I tried adding an index to the table, after the data had been
imported and prior to fetching the entry property values, and the
construction of the index took longer than it had taken previously to
fetch all of the property values.

> 
> >
> >or
> >
> >B) issue a single "SELECT EntryId, Value FROM Data WHERE PropertyId=?" query
> >   (bound to the specified PropertyId) and step through the results,
> >   using something like a hash table lookup to map the EntryId values
> >   (returned from the query) back to an index into the external list of
> >   entries.
> 
> 
> This may help, as you'll not be using the primary key index, and thus the
> index pages will not be competing with the table pages for memory.
> 
> 
> >
> >The values extracted from the database are to be copied into an entry
> >property data structure, having the same order as the external list of
> >entries.
> >
> 
> If you must group the values by PropertyId rather than EntryId, then
> insert them into the database in that order. Is that possible?
> 

During import, the data arrives entry by entry, with each entry having
multiple property values (for possibly same or different properties
as previous entries).

It is not feasible to group the incoming data by PropertyId, and new values
may be added (or replace existing values) in the table at any time.

> That, or increase the amount of RAM you have.
> 
> Christian

Increasing RAM is probably not an option for me or for many of the
customers who will be using the software.

Thank you for your suggestions.

----------------------
Andy Spencer/ Schrodinger, Inc./ [EMAIL PROTECTED] 


Reply via email to