On 3/21/07, Chris Jones <[EMAIL PROTECTED]> wrote:

Hi all,

I have a very simple schema.  I need to assign a unique identifier to a
large collection of strings, each at most 80-bytes, although typically
shorter.

The problem is I have 112 million of them.

My schema looks as follows:

CREATE TABLE rawfen ( fen VARCHAR(80) );
CREATE INDEX rawfen_idx_fen ON rawfen(fen);

Unforuntately, data loading this table takes virtually forever.  After 24
hours, its not finished, and that is inserting rows in transactions of
100,000 rows per transaction.

I tried dropping the index, and building it after row insertion.

That has two problems.  First, since I have no index, I can't guarantee
string uniqueness (and I'd like to).

Second, it still doesn't solve my speed problem.   Insertion without the
index takes a little over an hour, but the index creation never finishes.
Well, I gave it 6 hours and it was unclear if it was making any progress.

..

It is the uniqueness that is killing it. Doesn't matter how you do it.
As more your bucket becomes bigger, it has to do more work to
determine if the string is unique or not. I tried a little test. First
with

CREATE TABLE tbl (fen PRIMARY KEY)

I got results that looked like so, committing every 100,000,...

100000:  9 wallclock secs ( 7.36 usr +  1.63 sys =  8.99 CPU)
200000: 12 wallclock secs ( 8.06 usr +  2.87 sys = 10.93 CPU)
300000: 18 wallclock secs ( 8.27 usr +  3.29 sys = 11.56 CPU)
400000: 14 wallclock secs ( 8.32 usr +  3.50 sys = 11.82 CPU)
500000: 14 wallclock secs ( 8.45 usr +  3.71 sys = 12.16 CPU)
600000: 15 wallclock secs ( 8.49 usr +  3.89 sys = 12.38 CPU)
700000: 16 wallclock secs ( 8.68 usr +  4.15 sys = 12.83 CPU)
800000: 15 wallclock secs ( 8.61 usr +  4.16 sys = 12.77 CPU)
900000: 16 wallclock secs ( 8.69 usr +  4.29 sys = 12.98 CPU)
1000000: 17 wallclock secs ( 8.65 usr +  4.38 sys = 13.03 CPU)

Then I tried with

CREATE TABLE tbl (fen)

It was a more flat curve like so...

100000:  5 wallclock secs ( 5.19 usr +  0.09 sys =  5.28 CPU)
200000:  6 wallclock secs ( 5.23 usr +  0.09 sys =  5.32 CPU)
300000:  5 wallclock secs ( 5.24 usr +  0.09 sys =  5.33 CPU)
400000:  6 wallclock secs ( 5.23 usr +  0.10 sys =  5.33 CPU)
500000:  5 wallclock secs ( 5.22 usr +  0.09 sys =  5.31 CPU)
600000:  5 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)
700000:  5 wallclock secs ( 5.23 usr +  0.09 sys =  5.32 CPU)
800000:  6 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)
900000:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
1000000:  5 wallclock secs ( 5.24 usr +  0.10 sys =  5.34 CPU)

Well, since you want to tag each string with a unique id, how about

CREATE TABLE tbl (id INTEGER PRIMARY KEY, fen)

I get a flat curve again...

100000:  6 wallclock secs ( 5.22 usr +  0.10 sys =  5.32 CPU)
200000:  5 wallclock secs ( 5.24 usr +  0.09 sys =  5.33 CPU)
300000:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
400000:  5 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
500000:  5 wallclock secs ( 5.27 usr +  0.11 sys =  5.38 CPU)
600000:  6 wallclock secs ( 5.27 usr +  0.10 sys =  5.37 CPU)
700000:  6 wallclock secs ( 5.27 usr +  0.09 sys =  5.36 CPU)
800000:  6 wallclock secs ( 5.27 usr +  0.10 sys =  5.37 CPU)
900000:  6 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)
1000000:  5 wallclock secs ( 5.26 usr +  0.10 sys =  5.36 CPU)

Problem solved... id is your unique tag. It will be done in about 1.5
hours on my MacBook Pro.


--
Puneet Kishor http://punkish.eidesis.org/
Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/
Open Source Geospatial Foundation http://www.osgeo.org/education/
---------------------------------------------------------------------
collaborate, communicate, compete
=====================================================================

-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to