Re: [sqlite] Prohibitive indexing time with large tables

Jérôme Magnin Mon, 15 Feb 2010 11:10:06 -0800

Sorry folks, it took me time (and a brand new brain) to figure out how I 
could post a reply that would appear on the mailing list (I'm new to 
that kind of sport).


To Roger Binns:

> As another data point, my data set is 10M records and the 6 indices I 
> need
> are created in under two minutes.  (I also analyze.)
Wow, impressive ! On my side, I am talking about 2-3 hours per table 
(and I need several of them).

> The database is 2GB
> but the least amount of memory I have is 4GB.  I also use a 32kb page 
> size,
> larger cache etc.
I have of course tried to increase cache size and database page size (up 
to 32K) but the effect was
not really measurable. I therefore (maybe too quickly) have classified 
such parameters as weak-working
ones for my particular problem, although I admit I also had great hopes 
about how they would impact the indexing
performance. This lack of observable effect remains a mystery to me. The 
only clue could be that the Python SQLite wrapper (my package is 
Python-based) does actually not correctly forward the "PRAGMA xxx" 
commands to the SQLite engine.

> Surely the index generation is a one time thing being done at upgrade 
> time.
At table population time, actually. The table is then further used 
exclusively in read-only mode.

> Couldn't that be done late at night so taking several hours would be ok?
Well, yes and no. This software does not run on a server but is operated 
by some of my
collaborators on their office workstations. They cannot always predict 
when they need to
perform a linkage task but when the situation suddenly occurs, they 
should not be constrained to
wait until next workday in order to dispose of a usable database. 
Besides, I really find
the situation inelegant and want my software to be efficient even for 
large datasets.


To Max Vlasov:

> So there was no way to ignore some external files approach and I did it
> filing memory-mapped file with the contents of Text field, while filling
> also array in memory saving offsets and length of the strings. After that
> quicksort of that offset array took about 5 minutes and inserting the 
> textes
> in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> total. First 5 minutes was possible since we exchange only offsets, 
> not data
> and other 5 minutes since inserting sorted data into B -tree is really a
> fast operation.
>
> Although real life data can be different, the things that worked might be
> the same. So anyone can use this method occupying not more than the 
> sqlite
> file itself for temporary storage and ending up with the data in 
> necessary
> order inside sqlite database after that. I know that there are many 
> things
> to take into account like the memory size and the size of the actual data
> but it's just a proof of concept.
>
> Also I think sqlite could use the same approach internally for creating
> index for existing data. The db is probably already exclusively locked 
> while
> CREATE INDEX is in process so having temporary array accessing and 
> storing
> for example file offsets of particular records should not be a problem.
Nice solution (of the type I already fiddled around, actually, as you 
can imagine).
This variant still poses 2 problems for me:

1) Its workability is RAM-limited, and therefore not necessarily robust 
to an
increase in dataset size beyond a certain limit I am already close to 
(Win32-based
processes are still bound to max. 2GB/process, unfortunately).

2) I need to create 2 indices on 2 different columns whose contents is 
totally
uncorrelated with respect to sort order. Your solution would nicely 
reduce indexing time
of the 1st column but what about the 2nd one ?...


To Jay Kreibich:

> If you haven't already try increasing the default cache size by 100x or
> higher.  Just be sure you stay in physical memory.
I have, but without apparent effect (see my reply to Roger Binns).

> Yes... Virtual Tables will end up doing flat table-scans in most cases
> (unless you get very complex with the code) but is often extremely fast,
> especially if you're able to mmap the file.  I've used them with
> great success for many "read-only" type tables where the typical
> access pattern is a scan and aggregation.  Log files, for example.
> Although, if you're talking about creating multiple indexes, I assume
> that isn't going to work in this situation.
Right. Then there would also be another problem to me using virtual 
tables : as I am using SQLite through Python,
in order to be able to use virtual tables I would have to migrate my 
code to make use of a Python wrapper that is closer to
SQLite's native API, like APSW. Such wrapper is then non DB-API 
compliant and would break code genericity
(my system is architectured to be able to connect to various database 
engines, but the price to pay is to remain DB-API compliant).


*  *  *

It seems that my post has spawn some debates that sometimes rely on 
false assumptions about the nature of
what I am doing, so let me restate it shortly:
1) An elementary "duty cycle" in my system consists in creating 2 sqlite 
databases from 2 flat files (= user datasets), each composed of a 
certain number of records (up to 25 millions each).
2) In addition to one table containing all records from the flatfile 
(after normalization), each database must be supplemented with 
additional auxiliary tables that I need to populate with records 
"fingerprints" (read: digests) of various kinds.
3) Each of these auxiliary tables must be indexed on their "record id" 
column as well as on their "fingerprint" column.
My performance problem occurs mainly at step 3.


Thanks to all for sharing your toughts & suggestions so far.
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Prohibitive indexing time with large tables

Reply via email to