Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Joe Wilson
--- [EMAIL PROTECTED] wrote:
> Joe Wilson <[EMAIL PROTECTED]> wrote:
> > 
> > As for the stats from sqlite3_analyzer, they seem to be in the right 
> > ballpark.
> > But I'm not sure its heuristic accounts for rows that are significantly 
> > larger
> > than the page size, though. In such cases I am seeing higher than expected 
> > fragmentation after a VACUUM. This is just a guess, of course.
> 
> I'm not sure sqlite3_analyzer does any of the fragmentation
> measurement right.  For that matter, how do you measure
> fragmentation with a number?  (Suggestions are welcomed.)
> 
> I may yet just yank that whole fragmentation measurement
> idea.

With non-volatile RAM drives getting larger and cheaper by the day, 
you may not need to worry about fragmentation, fsync and disk-seek time 
in a year or two.



 

Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread drh
Joe Wilson <[EMAIL PROTECTED]> wrote:
> 
> As for the stats from sqlite3_analyzer, they seem to be in the right ballpark.
> But I'm not sure its heuristic accounts for rows that are significantly larger
> than the page size, though. In such cases I am seeing higher than expected 
> fragmentation after a VACUUM. This is just a guess, of course.
> 

I'm not sure sqlite3_analyzer does any of the fragmentation
measurement right.  For that matter, how do you measure
fragmentation with a number?  (Suggestions are welcomed.)

I may yet just yank that whole fragmentation measurement
idea.

--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Joe Wilson
--- [EMAIL PROTECTED] wrote:
> Joe Wilson <[EMAIL PROTECTED]> wrote:
> > 
> > See also: Changes to support fragmentation analysis in sqlite3_analyzer.
> > http://www.sqlite.org/cvstrac/chngview?cn=3634
> > 
> 
> I'm not real sure those patches are working right.
> I need to revisit that whole fragmentation analysis
> thing before the next release.  Somebody please
> remind me

What do you suspect is not working right?

I've run the new CVS VACUUM on large databases without any apparent ill effect
and it's noticably faster than the previous VACUUM implementation.

As for the stats from sqlite3_analyzer, they seem to be in the right ballpark.
But I'm not sure its heuristic accounts for rows that are significantly larger
than the page size, though. In such cases I am seeing higher than expected 
fragmentation after a VACUUM. This is just a guess, of course.


 

Expecting? Get great news right away with email Auto-Check. 
Try the Yahoo! Mail Beta.
http://advision.webevents.yahoo.com/mailbeta/newmail_tools.html 

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread drh
Joe Wilson <[EMAIL PROTECTED]> wrote:
> 
> See also: Changes to support fragmentation analysis in sqlite3_analyzer.
> http://www.sqlite.org/cvstrac/chngview?cn=3634
> 

I'm not real sure those patches are working right.
I need to revisit that whole fragmentation analysis
thing before the next release.  Somebody please
remind me
--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Martin Jenkins

P Kishor wrote:
Mac/Unix person meself, but the Windows XP sort is pretty darn good as 
well.


I'll take a look. Last time I used it it was useless. Win9x days? these 
days (especially for a one off) I'd probably go straight to doing it in 
Python to avoid x-platform syntax issues.


Martin

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Chris Jones

Ah yes, I should read more carefully :)  

Thanks, right, I was actually guaranteeing uniqueness originally by just
fetching and then inserting only if there wasn't a match (I needed a rowid
if the row existed anyway).  Now I'm guaranteeing uniqueness by letting sort
do the work for me, but similarly to my last response, it probably makes
sense for me to add it to my schema as a sanity check if nothing else.

Thanks,
Chris



Derrell.Lipman wrote:
> 
> Chris Jones <[EMAIL PROTECTED]> writes:
> 
>> Derrell.Lipman wrote:
> 
> 
> So to guarantee that the *strings* are unique, you need a UNIQUE index on
> the
> string field.  The ROWID is the INTEGER PRIMARY KEY whether you specify a
> different name for it or not, but that will not guarantee that each of
> your
> *strings* is unique.  Only a UNIQUE index on the string field will do
> that.
> 
> As long as you understand this, no need to reply.
> 
> Cheers,
> 
> Derrell
> 
> -
> To unsubscribe, send email to [EMAIL PROTECTED]
> -
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9626741
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Joe Wilson
--- [EMAIL PROTECTED] wrote:
> Gerry Snyder <[EMAIL PROTECTED]> wrote:
> > Chris Jones wrote:
> > > Hi all,
> > > 
> > > I have a very simple schema.  I need to assign a unique identifier to a
> > > large collection of strings, each at most 80-bytes, although typically
> > > shorter.
> > > 
> > > The problem is I have 112 million of them.
> > 
> > Maybe you could start by breaking the data into 8 equal groups and make 
> > a table of each group. Then merge the original groups pairwise, then 
> > merge those 4 groups, and finally the 2 semifinal groups (kinda like 
> > March Madness, come to think of it). Since each merging will be of 
> > already sorted/indexed data, it might save a lot of time.
> > 
> 
> This is the right idea.
> 
> The problem is that your working set is bigger than your cache
> which is causing thrashing.  I suggest a solution like this:
> 
> Add entries to table ONE until the table and its unique index get
> so big that they no longer fit in cache all at once.  Then
> transfer ONE into TWO like this:
> 
>INSERT INTO two SELECT * FROM one ORDER BY unique_column;
> 
> The ORDER BY is important here.
> 
> Do the above a time or two until TWO is signficantly larger than ONE.
> Then do the same into TWO_B.  Later combine TWO and TWO_B into THREE:
> 
>INSERT INTO three SELECT * FROM two ORDER BY unique_column;
>INSERT INTO three SELECT * FROM two_b ORDER BY unique_column;
> 
> Repeat as necessary for FOUR, FIVE, SIX and so forth.

I've tried something like the algorithm you've proposed and it's much
slower than pre-sorting all the data prior to bulk insert.
It may have something to do with the cost of multiple repeated inserts 
for each original row.

Here's a different attempt at speeding up bulk insert following your
suggestion from Ticket 2075:

 http://www.mail-archive.com/sqlite-users%40sqlite.org/msg22143.html

Any suggestions to speed it up are welcome.
Sample insert speed test included.


 

Don't get soaked.  Take a quick peek at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Gerry Snyder

Chris Jones wrote:






I probably should have made this more explicit, but in sqlite, every row has
a unique identifier named rowid, which exists even if it isn't explicity
declared in the schema, and I was depending on that.   If you declare a
PRIMARY KEY, then this replaces rowid.


A tiny correction: a column will not replace rowid unless it is
INTEGER PRIMARY KEY.

From the web page:

Specifying a PRIMARY KEY normally just creates a UNIQUE index on the 
corresponding columns. However, if primary key is on a single column 
that has datatype INTEGER, then that column is used internally as the 
actual key of the B-Tree for the table.


Gerry

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Derrell . Lipman
Chris Jones <[EMAIL PROTECTED]> writes:

> Derrell.Lipman wrote:
>> 
>> Chris Jones <[EMAIL PROTECTED]> writes:
>> 
>> I don't think that your original solution solves that problem either.  You
>> first posted this schema:
>> 
>>> My schema looks as follows:
>>> 
>>> CREATE TABLE rawfen ( fen VARCHAR(80) );
>>> CREATE INDEX rawfen_idx_fen ON rawfen(fen);
>> 
>> but you didn't indicate that rawfen_idx_fen is a UNIQUE INDEX so it won't
>> complain if there are duplicate strings.  To accomplish this (but not
>> solving
>> your timing issue), you need this:
>> 
>>   CREATE TABLE rawfen ( fen VARCHAR(80) );
>>   CREATE UNIQUE INDEX rawfen_idx_fen ON rawfen(fen);
>> 
>
> I probably should have made this more explicit, but in sqlite, every row has
> a unique identifier named rowid, which exists even if it isn't explicity
> declared in the schema, and I was depending on that.   If you declare a
> PRIMARY KEY, then this replaces rowid.
>
> Of course, it's probably better practice to explicitly declare the primary
> key, but anyway, that's where I was going with it.

Those are two separate issues.  Your declared problem was:

> I don't think that solves my problem.  Sure, it guarantees that the IDs are
> unique, but not the strings.  

So to guarantee that the *strings* are unique, you need a UNIQUE index on the
string field.  The ROWID is the INTEGER PRIMARY KEY whether you specify a
different name for it or not, but that will not guarantee that each of your
*strings* is unique.  Only a UNIQUE index on the string field will do that.

As long as you understand this, no need to reply.

Cheers,

Derrell

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Chris Jones



Derrell.Lipman wrote:
> 
> Chris Jones <[EMAIL PROTECTED]> writes:
> 
> I don't think that your original solution solves that problem either.  You
> first posted this schema:
> 
>> My schema looks as follows:
>> 
>> CREATE TABLE rawfen ( fen VARCHAR(80) );
>> CREATE INDEX rawfen_idx_fen ON rawfen(fen);
> 
> but you didn't indicate that rawfen_idx_fen is a UNIQUE INDEX so it won't
> complain if there are duplicate strings.  To accomplish this (but not
> solving
> your timing issue), you need this:
> 
>   CREATE TABLE rawfen ( fen VARCHAR(80) );
>   CREATE UNIQUE INDEX rawfen_idx_fen ON rawfen(fen);
> 

I probably should have made this more explicit, but in sqlite, every row has
a unique identifier named rowid, which exists even if it isn't explicity
declared in the schema, and I was depending on that.   If you declare a
PRIMARY KEY, then this replaces rowid.

Of course, it's probably better practice to explicitly declare the primary
key, but anyway, that's where I was going with it.

Thanks,
Chris

-- 
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9626321
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Dennis Cote

Chris Jones wrote:

So, I did a "sort -u -S 1800M fenout.txt > fenoutsort.txt"

The sort took about 45 minutes, which is acceptable for me (it was much
longer without the -S option to tell it to make use of more memory), and
then loading the table was very efficient.  Inserting all the rows into my
table in sorted order took only 18 minutes.  



  
The sort command is more sophisticated than its usage info would lead 
one (at least me) to believe.


After I saw your post I checked the sort source and found that it does 
all the splitting, sorting, and merging that I had suggested internally 
when given a file bigger than its allowed buffer space. No need for any 
scripts at all.


I'm glad you found a suitable solution.

Dennis Cote



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Joe Wilson
--- Dennis Cote <[EMAIL PROTECTED]> wrote:
> You could also improve the locality in the database file further by 
> running a vacuum command after it has been created. this will move the 
> pages around so that the page of the table are contiguous and so are the 
> pages of the index, rather than having them interspersed with each other.

In SQLite 3.3.13 and earlier, VACUUM does indeed intermix the table pages 
with the index pages leading to fragmentation. It's also quite slow on large 
tables with multiple indexes unless you use a large cache.

This patch in CVS address these problems:
http://www.sqlite.org/cvstrac/chngview?cn=3643

See also: Changes to support fragmentation analysis in sqlite3_analyzer.
http://www.sqlite.org/cvstrac/chngview?cn=3634



 

Need Mail bonding?
Go to the Yahoo! Mail Q&A for great tips from Yahoo! Answers users.
http://answers.yahoo.com/dir/?link=list&sid=396546091

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread John Stanton
A fast technique to achieve your objective is to perform what I believe 
is called a "monkey puzzle" sort.  The data is not moved, instead an 
array of descriptors to each element is sorted.  The output is realized 
by scanning the list of descriptors and picking up the associated record 
from the input list.


Using a modest machine your application should run in less than ten 
minutes using that method.  One way we use it is as a first stage in 
building a B-Tree index rapidly.


Chris Jones wrote:

Thanks everyone for your feedback.

I ended up doing a presort on the data, and then adding the data in order.  
At first I was a little concerned about how I was going to implement an

external sort on a data set that huge, and realized that the unix "sort"
command can handle large files, and in fact does it pretty efficiently.

So, I did a "sort -u -S 1800M fenout.txt > fenoutsort.txt"

The sort took about 45 minutes, which is acceptable for me (it was much
longer without the -S option to tell it to make use of more memory), and
then loading the table was very efficient.  Inserting all the rows into my
table in sorted order took only 18 minutes.  


So, all in all, I can now load the table in just about an hour, which is
great news for me.

Thanks!
Chris




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread John Stanton
An issue with cache is cache shadowing, the churning as data is copied 
from one cache to another to another.


An example is the speed-up achieved on network accesses by using 
sendfile or TransmitFile and bypassing up to four levels of buffering 
for a message being despatched to a network interface using send or similar.


Another is opening a file using open and writing to it with write and 
reading with read when you are transferring buffers full of data to or 
from a file.  You avoid the extra level of buffer copying inherent in 
fopen.  If you dispense with the reads and writes and access the virtual 
memory directly you get a further win.


A modern OS uses main memory as a file system cache by virtue of its 
virtual memory capability.  If you take advantage of that your 
applications will run faster.


P Kishor wrote:

On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:


"P Kishor" <[EMAIL PROTECTED]> wrote:
> Richard,
>
> On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> ...
> > The problem is that your working set is bigger than your cache
> > which is causing thrashing.  I suggest a solution like this:
> >
> > Add entries to table ONE until the table and its unique index get
> > so big that they no longer fit in cache all at once.
>
> What is this "cache" you talk about? Is this the hard disk cache, or
> some kind of OS-level cache, or a SQLite level cache?

All of the above.

> How can I find
> the size of this, and is this cache user-adjustable? If yes, how?
>

On Linux, the "top" command will show you how much OS disk
cache you are currently using.  The hard disk cache is usually
small enough that it can be neglected.  The OS cache is also
usually much larger than SQLite's own internal cache.

You might run an experiment where you start inserting and
timing each insert, then switch to a different table when
the insert speed drops below a threshold.




ok. I was hoping for something magical like 'showme_cache_size'. This
sounds more voodoo-ish, but do-able. In any case, it doesn't seem that
this OS cache is user-adjustable, at least not without pipe wrenches
and WD-40.

On a note more related to the OP, even if one could adjust the cache
to the working set (and that would be a moving target -- either make
the cache gigantic, or keep on querying the relative sizes of the two
and somehow keep on adjusting the cache, the time taken to determine
uniqueness in a bigger working set will keep on increasing as the
working set itself keeps on increasing... the curve won't be flat.




-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread P Kishor

On 3/22/07, Martin Jenkins <[EMAIL PROTECTED]> wrote:

Chris Jones wrote:
> realized that the unix "sort"

If I'd known you were on 'nix I'd have suggested using 'sort' and/or
'md5sum' about 12 hours ago. ;)




Mac/Unix person meself, but the Windows XP sort is pretty darn good as well.


--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
-
collaborate, communicate, compete
=

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Martin Jenkins

Chris Jones wrote:

realized that the unix "sort"


If I'd known you were on 'nix I'd have suggested using 'sort' and/or 
'md5sum' about 12 hours ago. ;)


Martin

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread John Stanton

You could sort the table then perform a merge which removes duplicates.

Chris Jones wrote:

I don't think that solves my problem.  Sure, it guarantees that the IDs are
unique, but not the strings.  


My whole goal is to be able to create a unique identifier for each string,
in such a way that I dont have the same string listed twice, with different
identifiers.

In your solution, there is no way to lookup a string to see if it already
exists, since there is no index on the string.

Thanks,
Chris



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Chris Jones

Thanks everyone for your feedback.

I ended up doing a presort on the data, and then adding the data in order.  
At first I was a little concerned about how I was going to implement an
external sort on a data set that huge, and realized that the unix "sort"
command can handle large files, and in fact does it pretty efficiently.

So, I did a "sort -u -S 1800M fenout.txt > fenoutsort.txt"

The sort took about 45 minutes, which is acceptable for me (it was much
longer without the -S option to tell it to make use of more memory), and
then loading the table was very efficient.  Inserting all the rows into my
table in sorted order took only 18 minutes.  

So, all in all, I can now load the table in just about an hour, which is
great news for me.

Thanks!
Chris

-- 
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9618709
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Dennis Cote

Chris Jones wrote:

I've read elsewhere that this is a data locality issue, which certainly
makes sense.

And in those threads, a suggestion has been made to insert in sorted order. 
But it's unclear to me exactly what the sorting function would need to be -

it's likely my sorting function (say strcmp()) wouldn't match what sqlite
expects.  It's also a little unclear if I can even sort this many keys
externally with any ease.  After all, that's why I'm trying to use a
database.

  

Chris,

You are correct that this is a locality of reference issue. Using a 
presorted list speeds insertions because all the index insertions to be 
made at end of the index. To make this run faster you need prepare a 
presorted list. The question is how best to do that with a large file.


You didn't say what OS you are running, but these days most basic unix 
commands are available on Windows as well. I ams assuming you have your 
list of 112M lines in a text file. You can prepare a large sorted file 
pretty efficiently using a merge sort, since it reads the input files in 
one pass. The trick is the input files must be presorted. So you need to 
split your original data file into many smaller files, sort each in 
memory (where locality isn't a problem as long as the data fits in real 
memory), and then merge the individual sorted files. This is best done 
is a script.


You can use the split command to split you text file into many separate 
files. If you split your file into chunks of 1M lines you will have 112 
files to sort and merge. Each file will be less than 80 MB, so it should 
be in memory sortable.


   split -l 100 infile part

This will generate 112 files names partaa, partab, partac ...

Now you need to sort each of these files. Note the sort command can also 
eliminate any duplicate lines in your files (the -u for unique option).


   parts=`ls part*`
   for f in $parts
   do
   sort -o $f -u $f
   done

Now you have 112 sorted files that need to be merged.

   mv partaa sorted
   parts=`ls part*`
   for f in $parts
   do
   sort -m -o sorted sorted $f
   done   

You now have a single large sorted file that contains unique strings. 
You should be able to insert the strings from this file into your sqlite 
database much faster than you could before.


If you want you can get fancier with the script and merge pairs of 
sorted files into increasingly larger files until you have a single 
file. I'll leave that as an exercise for the reader.


You could also improve the locality in the database file further by 
running a vacuum command after it has been created. this will move the 
pages around so that the page of the table are contiguous and so are the 
pages of the index, rather than having them interspersed with each other.


HTH
Dennis Cote



A shell script that takes your raw text file as an argument.

#!/bin/sh
split -l 100 $1 part

parts=`ls part*`
for f in $parts
do
   sort -o $f -u $f
done

mv partaa sorted
parts=`ls part*`
for f in $parts
do
   sort -m -o sorted sorted $f
done  





-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread P Kishor

On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:

"P Kishor" <[EMAIL PROTECTED]> wrote:
> Richard,
>
> On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> ...
> > The problem is that your working set is bigger than your cache
> > which is causing thrashing.  I suggest a solution like this:
> >
> > Add entries to table ONE until the table and its unique index get
> > so big that they no longer fit in cache all at once.
>
> What is this "cache" you talk about? Is this the hard disk cache, or
> some kind of OS-level cache, or a SQLite level cache?

All of the above.

> How can I find
> the size of this, and is this cache user-adjustable? If yes, how?
>

On Linux, the "top" command will show you how much OS disk
cache you are currently using.  The hard disk cache is usually
small enough that it can be neglected.  The OS cache is also
usually much larger than SQLite's own internal cache.

You might run an experiment where you start inserting and
timing each insert, then switch to a different table when
the insert speed drops below a threshold.



ok. I was hoping for something magical like 'showme_cache_size'. This
sounds more voodoo-ish, but do-able. In any case, it doesn't seem that
this OS cache is user-adjustable, at least not without pipe wrenches
and WD-40.

On a note more related to the OP, even if one could adjust the cache
to the working set (and that would be a moving target -- either make
the cache gigantic, or keep on querying the relative sizes of the two
and somehow keep on adjusting the cache, the time taken to determine
uniqueness in a bigger working set will keep on increasing as the
working set itself keeps on increasing... the curve won't be flat.

--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
-
collaborate, communicate, compete
=

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread drh
"P Kishor" <[EMAIL PROTECTED]> wrote:
> Richard,
> 
> On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
> ...
> > The problem is that your working set is bigger than your cache
> > which is causing thrashing.  I suggest a solution like this:
> >
> > Add entries to table ONE until the table and its unique index get
> > so big that they no longer fit in cache all at once.
> 
> What is this "cache" you talk about? Is this the hard disk cache, or
> some kind of OS-level cache, or a SQLite level cache? 

All of the above.

> How can I find
> the size of this, and is this cache user-adjustable? If yes, how?
> 

On Linux, the "top" command will show you how much OS disk
cache you are currently using.  The hard disk cache is usually
small enough that it can be neglected.  The OS cache is also
usually much larger than SQLite's own internal cache.

You might run an experiment where you start inserting and
timing each insert, then switch to a different table when
the insert speed drops below a threshold.
--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Eduardo Morras

At 04:47 22/03/2007, you wrote:


I don't think that solves my problem.  Sure, it guarantees that the IDs are
unique, but not the strings.

My whole goal is to be able to create a unique identifier for each string,
in such a way that I dont have the same string listed twice, with different
identifiers.

In your solution, there is no way to lookup a string to see if it already
exists, since there is no index on the string.

Thanks,
Chris


So you have a file with data, a large collection of strings 112 
millions, each at most 80-bytes, although typically

shorter.

How do you manage repeated data? Replace? First In? Modify string to be unique?

You want put them in a sqlite3 database, but each string must be only 
once. The problem i see here is if you have a data file with repeated 
strings or not. I think that a grep or a perl script can help you a 
lot cleaning your data first. Then import to database will be fast.


HTH




--
"Hemos encontrado al enemigo y somos nosotros"



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread P Kishor

Richard,

On 3/22/07, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
..

The problem is that your working set is bigger than your cache
which is causing thrashing.  I suggest a solution like this:

Add entries to table ONE until the table and its unique index get
so big that they no longer fit in cache all at once.


What is this "cache" you talk about? Is this the hard disk cache, or
some kind of OS-level cache, or a SQLite level cache? How can I find
the size of this, and is this cache user-adjustable? If yes, how?

--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
-
collaborate, communicate, compete
=

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Brad Stiles
Gerry Snyder <[EMAIL PROTECTED]> wrote:

> Chris Jones wrote:
> Hi all,

> I have a very simple schema.  I need to assign a unique identifier
> to a large collection of strings, each at most 80-bytes, although
> typically shorter.

Would it help to hash the strings, then save them in the DB, checking the hash 
instead of the string for duplication?  You might still get duplicates, 
depending on what hashing algorithm you use, but those should be relatively 
easy to find afterwards.  Hashing could take a while, but that should be a 
linear time operation all by itself.



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread drh
Ion Silvestru <[EMAIL PROTECTED]> wrote:
> > drh wrote:
> >   INSERT INTO two SELECT * FROM one ORDER BY unique_column;
> 
> >The ORDER BY is important here.
> 
> This is an excerpt from SQLite documentation:
> 
> The second form of the INSERT statement takes it data from a SELECT statement.
> The number of columns in the result of the SELECT must exactly match the 
> number
> of columns in the table if no column list is specified,
> or it must match the number of columns name in the column list.
> A new entry is made in the table for every row of the SELECT result.
> The SELECT may be simple or compound.
> If the SELECT statement has an ORDER BY clause, the ORDER BY is ignored.
>  ^
> 
> Question: ORDER BY is ignored or not ?
> 

You are looking at obsolete documentation.  See

  http://www.sqlite.org/cvstrac/tktview?tn=1923
  http://www.sqlite.org/cvstrac/chngview?cn=3356

--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Derrell . Lipman
Chris Jones <[EMAIL PROTECTED]> writes:

> I don't think that solves my problem.  Sure, it guarantees that the IDs are
> unique, but not the strings.  
>
> My whole goal is to be able to create a unique identifier for each string,
> in such a way that I dont have the same string listed twice, with different
> identifiers.

I don't think that your original solution solves that problem either.  You
first posted this schema:

> My schema looks as follows:
> 
> CREATE TABLE rawfen ( fen VARCHAR(80) );
> CREATE INDEX rawfen_idx_fen ON rawfen(fen);

but you didn't indicate that rawfen_idx_fen is a UNIQUE INDEX so it won't
complain if there are duplicate strings.  To accomplish this (but not solving
your timing issue), you need this:

  CREATE TABLE rawfen ( fen VARCHAR(80) );
  CREATE UNIQUE INDEX rawfen_idx_fen ON rawfen(fen);

or, more concisely,

  CREATE TABLE rawfen ( fen VARCHAR(80) UNIQUE);

As previously stated, you can guarantee a unique id for each string with

  CREATE TABLE rawfen (id INTEGER PRIMARY KEY, fen VARCHAR(80) UNIQUE);

Cheers,

Derrell

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re[2]: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread Ion Silvestru

> drh wrote:
>   INSERT INTO two SELECT * FROM one ORDER BY unique_column;

>The ORDER BY is important here.

This is an excerpt from SQLite documentation:

The second form of the INSERT statement takes it data from a SELECT statement.
The number of columns in the result of the SELECT must exactly match the number
of columns in the table if no column list is specified,
or it must match the number of columns name in the column list.
A new entry is made in the table for every row of the SELECT result.
The SELECT may be simple or compound.
If the SELECT statement has an ORDER BY clause, the ORDER BY is ignored.
 ^

Question: ORDER BY is ignored or not ?



-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-22 Thread drh
Gerry Snyder <[EMAIL PROTECTED]> wrote:
> Chris Jones wrote:
> > Hi all,
> > 
> > I have a very simple schema.  I need to assign a unique identifier to a
> > large collection of strings, each at most 80-bytes, although typically
> > shorter.
> > 
> > The problem is I have 112 million of them.
> 
> Maybe you could start by breaking the data into 8 equal groups and make 
> a table of each group. Then merge the original groups pairwise, then 
> merge those 4 groups, and finally the 2 semifinal groups (kinda like 
> March Madness, come to think of it). Since each merging will be of 
> already sorted/indexed data, it might save a lot of time.
> 

This is the right idea.

The problem is that your working set is bigger than your cache
which is causing thrashing.  I suggest a solution like this:

Add entries to table ONE until the table and its unique index get
so big that they no longer fit in cache all at once.  Then
transfer ONE into TWO like this:

   INSERT INTO two SELECT * FROM one ORDER BY unique_column;

The ORDER BY is important here.

Do the above a time or two until TWO is signficantly larger than ONE.
Then do the same into TWO_B.  Later combine TWO and TWO_B into THREE:

   INSERT INTO three SELECT * FROM two ORDER BY unique_column;
   INSERT INTO three SELECT * FROM two_b ORDER BY unique_column;

Repeat as necessary for FOUR, FIVE, SIX and so forth.

--
D. Richard Hipp  <[EMAIL PROTECTED]>


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-21 Thread Gerry Snyder

Chris Jones wrote:

Hi all,

I have a very simple schema.  I need to assign a unique identifier to a
large collection of strings, each at most 80-bytes, although typically
shorter.

The problem is I have 112 million of them.


Maybe you could start by breaking the data into 8 equal groups and make 
a table of each group. Then merge the original groups pairwise, then 
merge those 4 groups, and finally the 2 semifinal groups (kinda like 
March Madness, come to think of it). Since each merging will be of 
already sorted/indexed data, it might save a lot of time.


Or perhaps do a block sort based on the first character of the string 
(or nth char if most of the first chars are the same) and have a bunch 
of smaller tables with that character as (part of) the table name. The 
global unique identifier could be the character concatenated with the 
rowid in its table.


HTH,

Gerry

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-21 Thread P Kishor

You stated in your OP


I need to assign a unique identifier to a large collection of strings


Obviously I misunderstood that to mean you wanted the strings tagged
uniquely, not that the strings were unique. In your case, it seems
then, you will have to put up with checking each string, and as the db
gets bigger, that task will take longer time.

Maybe someone else has other ideas...

On 3/21/07, Chris Jones <[EMAIL PROTECTED]> wrote:


I don't think that solves my problem.  Sure, it guarantees that the IDs are
unique, but not the strings.

My whole goal is to be able to create a unique identifier for each string,
in such a way that I dont have the same string listed twice, with different
identifiers.

In your solution, there is no way to lookup a string to see if it already
exists, since there is no index on the string.

Thanks,
Chris
--
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9607996
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-





--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
-
collaborate, communicate, compete
=

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-21 Thread Chris Jones

I don't think that solves my problem.  Sure, it guarantees that the IDs are
unique, but not the strings.  

My whole goal is to be able to create a unique identifier for each string,
in such a way that I dont have the same string listed twice, with different
identifiers.

In your solution, there is no way to lookup a string to see if it already
exists, since there is no index on the string.

Thanks,
Chris
-- 
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9607996
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-



Re: [sqlite] Index creation on huge table will never finish.

2007-03-21 Thread P Kishor

On 3/21/07, Chris Jones <[EMAIL PROTECTED]> wrote:


Hi all,

I have a very simple schema.  I need to assign a unique identifier to a
large collection of strings, each at most 80-bytes, although typically
shorter.

The problem is I have 112 million of them.

My schema looks as follows:

CREATE TABLE rawfen ( fen VARCHAR(80) );
CREATE INDEX rawfen_idx_fen ON rawfen(fen);

Unforuntately, data loading this table takes virtually forever.  After 24
hours, its not finished, and that is inserting rows in transactions of
100,000 rows per transaction.

I tried dropping the index, and building it after row insertion.

That has two problems.  First, since I have no index, I can't guarantee
string uniqueness (and I'd like to).

Second, it still doesn't solve my speed problem.   Insertion without the
index takes a little over an hour, but the index creation never finishes.
Well, I gave it 6 hours and it was unclear if it was making any progress.


..

It is the uniqueness that is killing it. Doesn't matter how you do it.
As more your bucket becomes bigger, it has to do more work to
determine if the string is unique or not. I tried a little test. First
with

CREATE TABLE tbl (fen PRIMARY KEY)

I got results that looked like so, committing every 100,000,...

10:  9 wallclock secs ( 7.36 usr +  1.63 sys =  8.99 CPU)
20: 12 wallclock secs ( 8.06 usr +  2.87 sys = 10.93 CPU)
30: 18 wallclock secs ( 8.27 usr +  3.29 sys = 11.56 CPU)
40: 14 wallclock secs ( 8.32 usr +  3.50 sys = 11.82 CPU)
50: 14 wallclock secs ( 8.45 usr +  3.71 sys = 12.16 CPU)
60: 15 wallclock secs ( 8.49 usr +  3.89 sys = 12.38 CPU)
70: 16 wallclock secs ( 8.68 usr +  4.15 sys = 12.83 CPU)
80: 15 wallclock secs ( 8.61 usr +  4.16 sys = 12.77 CPU)
90: 16 wallclock secs ( 8.69 usr +  4.29 sys = 12.98 CPU)
100: 17 wallclock secs ( 8.65 usr +  4.38 sys = 13.03 CPU)

Then I tried with

CREATE TABLE tbl (fen)

It was a more flat curve like so...

10:  5 wallclock secs ( 5.19 usr +  0.09 sys =  5.28 CPU)
20:  6 wallclock secs ( 5.23 usr +  0.09 sys =  5.32 CPU)
30:  5 wallclock secs ( 5.24 usr +  0.09 sys =  5.33 CPU)
40:  6 wallclock secs ( 5.23 usr +  0.10 sys =  5.33 CPU)
50:  5 wallclock secs ( 5.22 usr +  0.09 sys =  5.31 CPU)
60:  5 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)
70:  5 wallclock secs ( 5.23 usr +  0.09 sys =  5.32 CPU)
80:  6 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)
90:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
100:  5 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)

Well, since you want to tag each string with a unique id, how about

CREATE TABLE tbl (id INTEGER PRIMARY KEY, fen)

I get a flat curve again...

10:  6 wallclock secs ( 5.22 usr +  0.10 sys =  5.32 CPU)
20:  5 wallclock secs ( 5.24 usr +  0.09 sys =  5.33 CPU)
30:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
40:  5 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
50:  5 wallclock secs ( 5.27 usr +  0.11 sys =  5.38 CPU)
60:  6 wallclock secs ( 5.27 usr +  0.10 sys =  5.37 CPU)
70:  6 wallclock secs ( 5.27 usr +  0.09 sys =  5.36 CPU)
80:  6 wallclock secs ( 5.27 usr +  0.10 sys =  5.37 CPU)
90:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
100:  5 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)

Problem solved... id is your unique tag. It will be done in about 1.5
hours on my MacBook Pro.


--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
-
collaborate, communicate, compete
=

-
To unsubscribe, send email to [EMAIL PROTECTED]
-



[sqlite] Index creation on huge table will never finish.

2007-03-21 Thread Chris Jones

Hi all,

I have a very simple schema.  I need to assign a unique identifier to a
large collection of strings, each at most 80-bytes, although typically
shorter.

The problem is I have 112 million of them.

My schema looks as follows:

CREATE TABLE rawfen ( fen VARCHAR(80) );
CREATE INDEX rawfen_idx_fen ON rawfen(fen);

Unforuntately, data loading this table takes virtually forever.  After 24
hours, its not finished, and that is inserting rows in transactions of
100,000 rows per transaction.

I tried dropping the index, and building it after row insertion.

That has two problems.  First, since I have no index, I can't guarantee
string uniqueness (and I'd like to).

Second, it still doesn't solve my speed problem.   Insertion without the
index takes a little over an hour, but the index creation never finishes. 
Well, I gave it 6 hours and it was unclear if it was making any progress.

I've read elsewhere that this is a data locality issue, which certainly
makes sense.

And in those threads, a suggestion has been made to insert in sorted order. 
But it's unclear to me exactly what the sorting function would need to be -
it's likely my sorting function (say strcmp()) wouldn't match what sqlite
expects.  It's also a little unclear if I can even sort this many keys
externally with any ease.  After all, that's why I'm trying to use a
database.

Last, is this something that is likely to affect me if I port over to
another database?  Do others, say mySQL for instance, handle this better?

Does anyone have any suggestions?Upgrading my server isn't an option -
it already has 4Gig of memory and very fast disks in a RAID-5 configuration
(with a 512MB cache on the raid controller).

Thanks!
-- 
View this message in context: 
http://www.nabble.com/Index-creation-on-huge-table-will-never-finish.-tf3444218.html#a9604705
Sent from the SQLite mailing list archive at Nabble.com.


-
To unsubscribe, send email to [EMAIL PROTECTED]
-