Re: [sqlite] Prohibitive indexing time with large tables

2010-02-16 Thread Max Vlasov
Hello, Jérôme

Nice to hear you finally joined us with this really interesting discussion )


>
> To Max Vlasov:
>
> > in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> > total. First 5 minutes was possible since we exchange only offsets,
> > not data
> > and other 5 minutes since inserting sorted data into B -tree is really a
> > fast operation.
> .
> Nice solution (of the type I already fiddled around, actually, as you
> can imagine).
> This variant still poses 2 problems for me:
>
> 1) Its workability is RAM-limited, and therefore not necessarily robust
> to an
> increase in dataset size beyond a certain limit I am already close to
> (Win32-based
> processes are still bound to max. 2GB/process, unfortunately).
>
> 2) I need to create 2 indices on 2 different columns whose contents is
> totally
> uncorrelated with respect to sort order. Your solution would nicely
> reduce indexing time
> of the 1st column but what about the 2nd one ?...
>
>
>
You addressed real problems, and my when I try to run my test on a system
with lower RAM the results confirms these observations. But at least we
found some way to increase the speed for some datasets and some hardware
systems. Maybe some other approaches can improve the solution. The
suggestion about using RAM drive form Ibrahim for example was interesting,
I'd also mention for example using different hard drives together with merge
sort, but all these solutions breaks the beauty of sqlite imho, and as a
consequence the flexibiliy.

But the second one is really hard to solve, that's where sqlite internally
could take advantage of low-level data access, but I doubt this is an easy
task. I suppose making any special sorting with direct file access can even
break the beauty of vdbe not mentioning the danger of changing the code
significantly

By the way, you didn't mention the size of your "fingeprints". So can you
calculate the average index record size or total index size in case of your
20M records?

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-16 Thread Max Vlasov
>   I'm really surprised you're not seeing something from an increased
>  cache size.  It has always made a very noticeable difference in my own
>  manipulations, but I think the largest table I've ever worked with
>  "only" had 6M rows.
>


Jay, increasing cache size really helps, but till some size as I mentioned.
And it really makes sense. I can assume what is going on. While we're not
out of cache, all this B-tree structured data is completely inside RAM and
is able to update/change very quickly, but when the time for flushing comes,
we have many pages forming this b-tree and the question is what pages should
we flushed making this part of cache free in order to next 1,000,000 records
(with absolutely unpredictable content) use this tree more effectively.
Honestly I'm not aware of such algorithm.  So every next mass insert
produces many "collisions", moments when we need to write to some page, but
it's not in the write cache.

But I guess there is something sqlite could do with using existing cache
more effectively. I don't know the internals very well, but it seems that
both read and write cache exist in sqltie. But read cache is sometimes
almost unnecessary since this data already present in the system file cache.
In this case accessing some page in the cache and accessing it with xRead
interface of VFS will probably take the same or similar time (sure if
there's no encryption or other overhead of VFS). But I'm not sure about this
guess since it's sometimes hard to separate read and write cache and also
for B-tree inserts most of pages just read will probably be written soon

By the way, Alexey Pechnikov's recent post about his tests shows that
page_cache_size not always helps and that you should take into account not
only the number of rows, but also the total index size (quote from the
http://geomapx.blogspot.com/2009/11/degradation-of-indexing-speed.html: "As
we can see the index creating speed degradation is result of index size more
than SQLite page cache size.").

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-15 Thread Alexey Pechnikov
Hello!

See tests and some interpretation:
http://geomapx.blogspot.com/2009/11/degradation-of-indexing-speed.html

Best regards, Alexey Pechnikov.
http://pechnikov.tel/
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-15 Thread Jérôme Magnin
Jay A. Kreibich a écrit :

You said you're creating two databases.  Are you doing those one at a
  time or ATTACHing the other databases?

Doing them one at a time, indeed.


___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-15 Thread Jay A. Kreibich
On Mon, Feb 15, 2010 at 08:09:26PM +0100, Jérôme Magnin scratched on the wall:

> > The database is 2GB
> > but the least amount of memory I have is 4GB.  I also use a 32kb page 
> > size,
> > larger cache etc.
> I have of course tried to increase cache size and database page size (up 
> to 32K) but the effect was
> not really measurable. I therefore (maybe too quickly) have classified 
> such parameters as weak-working
> ones for my particular problem, although I admit I also had great hopes 
> about how they would impact the indexing
> performance. This lack of observable effect remains a mystery to me. The 
> only clue could be that the Python SQLite wrapper (my package is 
> Python-based) does actually not correctly forward the "PRAGMA xxx" 
> commands to the SQLite engine.

  You said you're creating two databases.  Are you doing those one at a
  time or ATTACHing the other databases?

  The reason I ask is because each database has its own cache, so you
  need to be sure you bump up the cache size on the database that has
  the index in it.  If that isn't the "main" database, you need to use
  the format "PRAGMA database.cache_size = value", or you're changing
  the wrong cache.

  I'm really surprised you're not seeing something from an increased
  cache size.  It has always made a very noticeable difference in my own
  manipulations, but I think the largest table I've ever worked with
  "only" had 6M rows.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-15 Thread Jérôme Magnin
Sorry folks, it took me time (and a brand new brain) to figure out how I 
could post a reply that would appear on the mailing list (I'm new to 
that kind of sport).

To Roger Binns:

> As another data point, my data set is 10M records and the 6 indices I 
> need
> are created in under two minutes.  (I also analyze.)
Wow, impressive ! On my side, I am talking about 2-3 hours per table 
(and I need several of them).

> The database is 2GB
> but the least amount of memory I have is 4GB.  I also use a 32kb page 
> size,
> larger cache etc.
I have of course tried to increase cache size and database page size (up 
to 32K) but the effect was
not really measurable. I therefore (maybe too quickly) have classified 
such parameters as weak-working
ones for my particular problem, although I admit I also had great hopes 
about how they would impact the indexing
performance. This lack of observable effect remains a mystery to me. The 
only clue could be that the Python SQLite wrapper (my package is 
Python-based) does actually not correctly forward the "PRAGMA xxx" 
commands to the SQLite engine.

> Surely the index generation is a one time thing being done at upgrade 
> time.
At table population time, actually. The table is then further used 
exclusively in read-only mode.

> Couldn't that be done late at night so taking several hours would be ok?
Well, yes and no. This software does not run on a server but is operated 
by some of my
collaborators on their office workstations. They cannot always predict 
when they need to
perform a linkage task but when the situation suddenly occurs, they 
should not be constrained to
wait until next workday in order to dispose of a usable database. 
Besides, I really find
the situation inelegant and want my software to be efficient even for 
large datasets.


To Max Vlasov:

> So there was no way to ignore some external files approach and I did it
> filing memory-mapped file with the contents of Text field, while filling
> also array in memory saving offsets and length of the strings. After that
> quicksort of that offset array took about 5 minutes and inserting the 
> textes
> in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> total. First 5 minutes was possible since we exchange only offsets, 
> not data
> and other 5 minutes since inserting sorted data into B -tree is really a
> fast operation.
>
> Although real life data can be different, the things that worked might be
> the same. So anyone can use this method occupying not more than the 
> sqlite
> file itself for temporary storage and ending up with the data in 
> necessary
> order inside sqlite database after that. I know that there are many 
> things
> to take into account like the memory size and the size of the actual data
> but it's just a proof of concept.
>
> Also I think sqlite could use the same approach internally for creating
> index for existing data. The db is probably already exclusively locked 
> while
> CREATE INDEX is in process so having temporary array accessing and 
> storing
> for example file offsets of particular records should not be a problem.
Nice solution (of the type I already fiddled around, actually, as you 
can imagine).
This variant still poses 2 problems for me:

1) Its workability is RAM-limited, and therefore not necessarily robust 
to an
increase in dataset size beyond a certain limit I am already close to 
(Win32-based
processes are still bound to max. 2GB/process, unfortunately).

2) I need to create 2 indices on 2 different columns whose contents is 
totally
uncorrelated with respect to sort order. Your solution would nicely 
reduce indexing time
of the 1st column but what about the 2nd one ?...


To Jay Kreibich:

> If you haven't already try increasing the default cache size by 100x or
> higher.  Just be sure you stay in physical memory.
I have, but without apparent effect (see my reply to Roger Binns).

> Yes... Virtual Tables will end up doing flat table-scans in most cases
> (unless you get very complex with the code) but is often extremely fast,
> especially if you're able to mmap the file.  I've used them with
> great success for many "read-only" type tables where the typical
> access pattern is a scan and aggregation.  Log files, for example.
> Although, if you're talking about creating multiple indexes, I assume
> that isn't going to work in this situation.
Right. Then there would also be another problem to me using virtual 
tables : as I am using SQLite through Python,
in order to be able to use virtual tables I would have to migrate my 
code to make use of a Python wrapper that is closer to
SQLite's native API, like APSW. Such wrapper is then non DB-API 
compliant and would break code genericity
(my system is architectured to be able to connect to various database 
engines, but the price to pay is to remain DB-API compliant).


*  *  *

It seems that my post has spawn some debates that sometimes rely on 
false assumptions about the nature of
what I am 

Re: [sqlite] Prohibitive indexing time with large tables

2010-02-15 Thread Hynes, Tom
> His original question was about the todo list found at the wiki.
> Not sure if any of the core developer will answer, but I would
> be interested as well...

I too have been long interested in when this might be addressed, since we see 
similar performance drop-offs with large numbers of rows.



___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Ibrahim A
Am 14.02.2010 18:53, schrieb Max Vlasov:
>> This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-)
>>
>>
>>  
> Hmm, managed to increase it to only 1,000,000 (x1024) size, larger values
> bring to "Out of memory" finally, and this values (1G) allows up to
> 6,000,000 fast records for 100 bytes field per record index. Still good,
> such extreme cache method can work in many cases I think.
>
>
For such problems i would suggest a different solution practiced by many 
companies :

Simply buying a Battery buffered RAM Drive. The Price (as an Example for 
the ASUS Drive) is about 250 $. For a governmental Solution 250 $ + RAM 
Costs won't really matter. The speed improvement (RAM Access is at least 
1 times faster than HD access) will make the costs ignorable.

On the software side by simply inserting new Indexes a few thousand at a 
time in a completely new BTREE (memory) and Merging existing Pages with 
this new ones would make special sorting or other techniques 
unnecessary. Its the simplest and fastes solution.
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Max Vlasov
> This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-)
>
>
Hmm, managed to increase it to only 1,000,000 (x1024) size, larger values
bring to "Out of memory" finally, and this values (1G) allows up to
6,000,000 fast records for 100 bytes field per record index. Still good,
such extreme cache method can work in many cases I think.

Max
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Marcus Grimm
> Marcus,
>
> although increasing cache size is a good method, it may sometimes give
> unpredictable results (in terms of performance).

hm... why ?

> I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index
> creation
> and it seems like there is no special sorting algorithm (CMIIW please).

I guess you are right, doesn't make sense to assume a sorting. Sorry.

> Excluding all "make once" lines from vdbe code left this loop lines:
>
> "19""Rowid"
> "20""Column"
> "21""MakeRecord"
> "22""IdxInsert"
> "23""Next"  .. 19
>
> that leads to conlusion that either single "indexed" insert or a step from
> index creation is actually just an insert of data inside B-tree without
> special preprocessing.

But to insert that index data sqlite needs to at least partly
compare existing values with the one that it is about to insert.
Here, I guess, comes the cache into account: When sqlite is not able
to hold the running index in memory and/or the entries to be inserted
are unsorted it has to free and reload pages randomly.
Anyway, I should better stop guessing and leave this to people
who are more in the sqlite internal code... :-)

>
> So we just consider mass insert into B-tree. As I suppose the cost of such
> insert is sometimes low, sometimes high, depends on different factors
> including occasional need for rebalancing. As long as all operations are
> in
> memory and all pages are "dirty" (not saved to disk) the results looks
> very
> promising. I tried to change cache size to 500,000 (x1024) for the same

This is appx. 500MB cache, why not trying with 2,000,000 cache size ? :-)

> base
> (10M records, 100 bytes record size) and adding from scrach was very fast
> until about 3M records (taskman showing about 800Mb of memory used at the
> moment) and dropped significally after that. Will this bad point be 3M or
> not 3M depends on the size of data used in index. So if any search inside
> the tree is a parsing through the cached pages (either by sqlite or
> underlying file system) and any page write operation involved in new tree
> operations are dirty pages that will be write later, there is no problem.
> Otherwise there is. The other problem is when total datasize exceeds 32bit
> address space, the cache stops giving significant improvement.

Yes, I was assuming that the index will more or less
fit into RAM. At my experiments it appears to be the case but I had
an index on the hash value not on the text column itselve, thus reducing
the memory required.

>
> I know that quicksorting I suggested have disadvantages, but in some way
> its
> results is more predictable. For example, we postpone any sqlite writing
> operation until the data is ready. Unfortunately, for index data size
> bigger
> than 32bit address space the file-mapping doesn't work, but it can be
> changed to simple Seek acces and at least the file system can do a decent
> job keeping frequentely access data (currently) in cache.
>
> Marcus, I didn't understand your comment about frequent reloading. As I
> read
> from the initial post, he has large data chunk of unsorted data, 20M
> records
> that needed to be accessed on a daily basis. So in his case any way that
> lead to data placed inside sqlite db indexed properly is ok. The problem
> is
> that on a daily basis couple of hours is a big price to pay. Jerome can
> correct me, but he still didn't add anything new to this discussion, hope
> he
> will.

His original question was about the todo list found at the wiki.
Not sure if any of the core developer will answer, but I would
be interested as well...

Anyway, thanks for discussing.

best

Marcus

>
> Nice approach with the hash and collision resolving inside query, will
> keep
> it in mind for future use )
>
> Max
>
> On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm
> wrote:
>
>> Just for my curiosity:
>>
>> Have you tried to increase the cache as already suggested ?
>>
>> I ran into a similar problem while playing with a artificial
>> test database with appx. 10 Mio records and creating an index.
>> Without drastically increasing the cache size sqlite appears
>> not to be able to create an index on a certain field - it
>> never stops within, say,  2 hours.
>> Sounds quite dissapointing keeping in mind how fast sqlite usually
>> operates, but it becomes clear when we consider that sqlite
>> is told to use only a few MB of memory per default. ALso your
>> quicksort mentioned below will be very slow if he needs
>> reload data from disk all the time.
>> So in my case it helps to tell sqlite to use appx 500MB memory
>> via pragma cache_size.
>>
>> Please note that if you create an index on a text field sqlite
>> will basically make a copy of the hole table in your case.
>>
>> Depending how you are using that Text field in a filter statement
>> you may consider adding an integer hash (e.g. CRC32) entry in your
>> table and create an index on that and slightly change your queries
>> like:
>> SELECT * 

Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Max Vlasov
Marcus,

although increasing cache size is a good method, it may sometimes give
unpredictable results (in terms of performance).
I looked at the vdbe code (EXPLAIN CREATE INDEX ... ) of the index creation
and it seems like there is no special sorting algorithm (CMIIW please).
Excluding all "make once" lines from vdbe code left this loop lines:

"19""Rowid"
"20""Column"
"21""MakeRecord"
"22""IdxInsert"
"23""Next"  .. 19

that leads to conlusion that either single "indexed" insert or a step from
index creation is actually just an insert of data inside B-tree without
special preprocessing.

So we just consider mass insert into B-tree. As I suppose the cost of such
insert is sometimes low, sometimes high, depends on different factors
including occasional need for rebalancing. As long as all operations are in
memory and all pages are "dirty" (not saved to disk) the results looks very
promising. I tried to change cache size to 500,000 (x1024) for the same base
(10M records, 100 bytes record size) and adding from scrach was very fast
until about 3M records (taskman showing about 800Mb of memory used at the
moment) and dropped significally after that. Will this bad point be 3M or
not 3M depends on the size of data used in index. So if any search inside
the tree is a parsing through the cached pages (either by sqlite or
underlying file system) and any page write operation involved in new tree
operations are dirty pages that will be write later, there is no problem.
Otherwise there is. The other problem is when total datasize exceeds 32bit
address space, the cache stops giving significant improvement.

I know that quicksorting I suggested have disadvantages, but in some way its
results is more predictable. For example, we postpone any sqlite writing
operation until the data is ready. Unfortunately, for index data size bigger
than 32bit address space the file-mapping doesn't work, but it can be
changed to simple Seek acces and at least the file system can do a decent
job keeping frequentely access data (currently) in cache.

Marcus, I didn't understand your comment about frequent reloading. As I read
from the initial post, he has large data chunk of unsorted data, 20M records
that needed to be accessed on a daily basis. So in his case any way that
lead to data placed inside sqlite db indexed properly is ok. The problem is
that on a daily basis couple of hours is a big price to pay. Jerome can
correct me, but he still didn't add anything new to this discussion, hope he
will.

Nice approach with the hash and collision resolving inside query, will keep
it in mind for future use )

Max

On Sun, Feb 14, 2010 at 4:03 PM, Marcus Grimm wrote:

> Just for my curiosity:
>
> Have you tried to increase the cache as already suggested ?
>
> I ran into a similar problem while playing with a artificial
> test database with appx. 10 Mio records and creating an index.
> Without drastically increasing the cache size sqlite appears
> not to be able to create an index on a certain field - it
> never stops within, say,  2 hours.
> Sounds quite dissapointing keeping in mind how fast sqlite usually
> operates, but it becomes clear when we consider that sqlite
> is told to use only a few MB of memory per default. ALso your
> quicksort mentioned below will be very slow if he needs
> reload data from disk all the time.
> So in my case it helps to tell sqlite to use appx 500MB memory
> via pragma cache_size.
>
> Please note that if you create an index on a text field sqlite
> will basically make a copy of the hole table in your case.
>
> Depending how you are using that Text field in a filter statement
> you may consider adding an integer hash (e.g. CRC32) entry in your
> table and create an index on that and slightly change your queries
> like:
> SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search';
> Unfortunately that works only for that simple form of "=" statements.
>
> Marcus
>
> > Jerome,
> >
> > It's an an interesting challenge, thanks for the post
> > I tried to research more and did some tests.
> > My test database contains a table with 10,000,000 records of the text 100
> > chars in length
> >
> > CREATE TABLE [TestTable] (
> > [Id] INTEGER PRIMARY KEY AUTOINCREMENT,
> > [Text] TEXT
> > )
> >
> > I suppose your data is different, but at least this one has more than 1M
> > records and not so small record size. My final base was about 1G in size.
> >
> > The default index creation
> >
> > CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )
> >
> > took very long time (about two hours or so).
> > Having the index before data insert didn't change anything, first records
> > had a good speed of append (about 10,000 records/sec significantly
> slowing
> > when the number of records exceeded 1-2M).
> >
> > So there was no way to ignore some external files approach and I did it
> > filing memory-mapped file with the contents of Text field, while filling
> > also array in memory 

Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Marcus Grimm
Just for my curiosity:

Have you tried to increase the cache as already suggested ?

I ran into a similar problem while playing with a artificial
test database with appx. 10 Mio records and creating an index.
Without drastically increasing the cache size sqlite appears
not to be able to create an index on a certain field - it
never stops within, say,  2 hours.
Sounds quite dissapointing keeping in mind how fast sqlite usually
operates, but it becomes clear when we consider that sqlite
is told to use only a few MB of memory per default. ALso your
quicksort mentioned below will be very slow if he needs
reload data from disk all the time.
So in my case it helps to tell sqlite to use appx 500MB memory
via pragma cache_size.

Please note that if you create an index on a text field sqlite
will basically make a copy of the hole table in your case.

Depending how you are using that Text field in a filter statement
you may consider adding an integer hash (e.g. CRC32) entry in your
table and create an index on that and slightly change your queries
like:
SELECT * From TestTable WHERE TextHas=12312 AND Text='Text to search';
Unfortunately that works only for that simple form of "=" statements.

Marcus

> Jerome,
>
> It's an an interesting challenge, thanks for the post
> I tried to research more and did some tests.
> My test database contains a table with 10,000,000 records of the text 100
> chars in length
>
> CREATE TABLE [TestTable] (
> [Id] INTEGER PRIMARY KEY AUTOINCREMENT,
> [Text] TEXT
> )
>
> I suppose your data is different, but at least this one has more than 1M
> records and not so small record size. My final base was about 1G in size.
>
> The default index creation
>
> CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )
>
> took very long time (about two hours or so).
> Having the index before data insert didn't change anything, first records
> had a good speed of append (about 10,000 records/sec significantly slowing
> when the number of records exceeded 1-2M).
>
> So there was no way to ignore some external files approach and I did it
> filing memory-mapped file with the contents of Text field, while filling
> also array in memory saving offsets and length of the strings. After that
> quicksort of that offset array took about 5 minutes and inserting the
> textes
> in sorted order to sqlite base other 5 minutes, so about 10 minutes it
> total. First 5 minutes was possible since we exchange only offsets, not
> data
> and other 5 minutes since inserting sorted data into B -tree is really a
> fast operation.
>
> Although real life data can be different, the things that worked might be
> the same. So anyone can use this method occupying not more than the sqlite
> file itself for temporary storage and ending up with the data in necessary
> order inside sqlite database after that. I know that there are many things
> to take into account like the memory size and the size of the actual data
> but it's just a proof of concept.
>
> Also I think sqlite could use the same approach internally for creating
> index for existing data. The db is probably already exclusively locked
> while
> CREATE INDEX is in process so having temporary array accessing and storing
> for example file offsets of particular records should not be a problem.
>
> Max
>
> On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin
> wrote:
>
>> Hi,
>>
>> This post is a question directed to D. Richard Hipp :
>>
>> I have been using SQLite for 3 years in a records linkage software
>> package I have developed. My organization recently had to use the
>> package to perform daily linkage of large administrative governmental
>> registers (up to 20 million records each). During the linkage process,
>> auxiliary tables containing records "fingerprints" must be created, and
>> two columns be indexed in them.
>>
>> SQLite provides decent indexing times for such tables with up to 1M
>> rows, but beyond this size the (already well-discussed) critical slowing
>> down of indexing performance due to disk nonlocality kicks in. The only
>> workaround I could imagine to ease the problem would be to duplicate the
>> auxiliary table and load pre-sorted rows in it, with sort key being the
>> column I intend to index on. This is unfortunately too costly in terms
>> of disk space used.
>>
>> I therefore had to develop an alternate datasource type (based on flat
>> files) in order for my system to be able to efficiently handle big
>> files. Which is a pity since SQLite provides great features I still
>> would like to be able to rely upon when dealing with large files.
>>
>> Now my question: in the "To do" section of SQLite's wiki, you mention
>> "Develop a new sort implementation that does much less disk seeking. Use
>> to improve indexing performance on large tables.". I have been seeing
>> this entry for 3 years but nothing concrete seems to have happened on
>> this front in the meantime. Do you have any idea about if (and when) you
>> will work on 

Re: [sqlite] Prohibitive indexing time with large tables

2010-02-14 Thread Max Vlasov
Jerome,

It's an an interesting challenge, thanks for the post
I tried to research more and did some tests.
My test database contains a table with 10,000,000 records of the text 100
chars in length

CREATE TABLE [TestTable] (
[Id] INTEGER PRIMARY KEY AUTOINCREMENT,
[Text] TEXT
)

I suppose your data is different, but at least this one has more than 1M
records and not so small record size. My final base was about 1G in size.

The default index creation

CREATE INDEX [idx_TestTable] ON [TestTable] ([Text] )

took very long time (about two hours or so).
Having the index before data insert didn't change anything, first records
had a good speed of append (about 10,000 records/sec significantly slowing
when the number of records exceeded 1-2M).

So there was no way to ignore some external files approach and I did it
filing memory-mapped file with the contents of Text field, while filling
also array in memory saving offsets and length of the strings. After that
quicksort of that offset array took about 5 minutes and inserting the textes
in sorted order to sqlite base other 5 minutes, so about 10 minutes it
total. First 5 minutes was possible since we exchange only offsets, not data
and other 5 minutes since inserting sorted data into B -tree is really a
fast operation.

Although real life data can be different, the things that worked might be
the same. So anyone can use this method occupying not more than the sqlite
file itself for temporary storage and ending up with the data in necessary
order inside sqlite database after that. I know that there are many things
to take into account like the memory size and the size of the actual data
but it's just a proof of concept.

Also I think sqlite could use the same approach internally for creating
index for existing data. The db is probably already exclusively locked while
CREATE INDEX is in process so having temporary array accessing and storing
for example file offsets of particular records should not be a problem.

Max

On Sat, Feb 13, 2010 at 5:00 PM, Jérôme Magnin wrote:

> Hi,
>
> This post is a question directed to D. Richard Hipp :
>
> I have been using SQLite for 3 years in a records linkage software
> package I have developed. My organization recently had to use the
> package to perform daily linkage of large administrative governmental
> registers (up to 20 million records each). During the linkage process,
> auxiliary tables containing records "fingerprints" must be created, and
> two columns be indexed in them.
>
> SQLite provides decent indexing times for such tables with up to 1M
> rows, but beyond this size the (already well-discussed) critical slowing
> down of indexing performance due to disk nonlocality kicks in. The only
> workaround I could imagine to ease the problem would be to duplicate the
> auxiliary table and load pre-sorted rows in it, with sort key being the
> column I intend to index on. This is unfortunately too costly in terms
> of disk space used.
>
> I therefore had to develop an alternate datasource type (based on flat
> files) in order for my system to be able to efficiently handle big
> files. Which is a pity since SQLite provides great features I still
> would like to be able to rely upon when dealing with large files.
>
> Now my question: in the "To do" section of SQLite's wiki, you mention
> "Develop a new sort implementation that does much less disk seeking. Use
> to improve indexing performance on large tables.". I have been seeing
> this entry for 3 years but nothing concrete seems to have happened on
> this front in the meantime. Do you have any idea about if (and when) you
> will work on this in the future ? Can I nourish reasonable hopes that
> the situation will improve on this aspect within the next 2 years ? This
> really has become a critical factor for me to decide on my future
> development strategy with this product.
>
> Thanks in advance for any useful information.
>
> Jerome
>
>
> ___
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-13 Thread Jay A. Kreibich
On Sat, Feb 13, 2010 at 11:21:25AM -0800, Roger Binns scratched on the wall:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Jérôme Magnin wrote:
> > SQLite provides decent indexing times for such tables with up to 1M 
> > rows, 
> 
> As another data point, my data set is 10M records and the 6 indices I need
> are created in under two minutes.  (I also analyze.)  The database is 2GB
> but the least amount of memory I have is 4GB.  I also use a 32kb page size,
> larger cache etc.

  A larger cache makes a huge difference with index creation.  As the
  website indicates, there are some known issues with the index sorting
  algorithm that, to my knowledge, have never bubbled to the top of the
  development team's TODO list.  The existing insert algorithm does its
  job, but tends to need access to a large number of pages which can cause
  major cache thrashing if the index pages overflow the cache size.

  If you haven't already try increasing the default cache size by 100x or
  higher.  Just be sure you stay in physical memory.

> > I therefore had to develop an alternate datasource type (based on flat 
> > files) in order for my system to be able to efficiently handle big 
> > files. Which is a pity since SQLite provides great features I still 
> > would like to be able to rely upon when dealing with large files.
> 
> You can also make virtual tables which will let you have the appearance of
> everything being in SQLite.

  Yes... Virtual Tables will end up doing flat table-scans in most cases
  (unless you get very complex with the code) but is often extremely fast,
  especially if you're able to mmap the file.  I've used them with
  great success for many "read-only" type tables where the typical
  access pattern is a scan and aggregation.  Log files, for example.

  Although, if you're talking about creating multiple indexes, I assume
  that isn't going to work in this situation.

  Virtual Tables that link to external data sources are really cool
  and hugely useful, however.  I'm surprised they aren't used more.
  If nothing else, they make great data importers.

   -j

-- 
Jay A. Kreibich < J A Y  @  K R E I B I.C H >

"Our opponent is an alien starship packed with atomic bombs.  We have
 a protractor."   "I'll go home and see if I can scrounge up a ruler
 and a piece of string."  --from Anathem by Neal Stephenson
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users


Re: [sqlite] Prohibitive indexing time with large tables

2010-02-13 Thread Roger Binns
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Jérôme Magnin wrote:
> SQLite provides decent indexing times for such tables with up to 1M 
> rows, 

As another data point, my data set is 10M records and the 6 indices I need
are created in under two minutes.  (I also analyze.)  The database is 2GB
but the least amount of memory I have is 4GB.  I also use a 32kb page size,
larger cache etc.

> The only 
> workaround I could imagine to ease the problem would be to duplicate the 
> auxiliary table and load pre-sorted rows in it, with sort key being the 
> column I intend to index on. This is unfortunately too costly in terms 
> of disk space used.

What numbers are you getting? One man's unacceptable is another's ok.
Surely the index generation is a one time thing being done at upgrade time.
 Couldn't that be done late at night so taking several hours would be ok?

> I therefore had to develop an alternate datasource type (based on flat 
> files) in order for my system to be able to efficiently handle big 
> files. Which is a pity since SQLite provides great features I still 
> would like to be able to rely upon when dealing with large files.

You can also make virtual tables which will let you have the appearance of
everything being in SQLite.

Note that SQLite is unlikely to beat a custom storage format specifically
formatted for a particular data set, as it is general code.

> Can I nourish reasonable hopes that the situation will improve on this aspect
> within the next 2 years ? This
> really has become a critical factor for me to decide on my future 
> development strategy with this product.

Unless you post a representative schema and data set somewhere, no one else
can make it better for what you are doing.  Alternatively see the first
section of http://sqlite.org/support.html

Roger
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkt2+7UACgkQmOOfHg372QRC/wCeLu5+vffdSfenbi+vM5qG4OnU
iFUAn3cJPbZRbSNjNyXTXhtixRaGBgOJ
=KCyU
-END PGP SIGNATURE-
___
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users