On 3/21/07, Chris Jones <[EMAIL PROTECTED]> wrote:
Hi all, I have a very simple schema. I need to assign a unique identifier to a large collection of strings, each at most 80-bytes, although typically shorter. The problem is I have 112 million of them. My schema looks as follows: CREATE TABLE rawfen ( fen VARCHAR(80) ); CREATE INDEX rawfen_idx_fen ON rawfen(fen); Unforuntately, data loading this table takes virtually forever. After 24 hours, its not finished, and that is inserting rows in transactions of 100,000 rows per transaction. I tried dropping the index, and building it after row insertion. That has two problems. First, since I have no index, I can't guarantee string uniqueness (and I'd like to). Second, it still doesn't solve my speed problem. Insertion without the index takes a little over an hour, but the index creation never finishes. Well, I gave it 6 hours and it was unclear if it was making any progress.
.. It is the uniqueness that is killing it. Doesn't matter how you do it. As more your bucket becomes bigger, it has to do more work to determine if the string is unique or not. I tried a little test. First with CREATE TABLE tbl (fen PRIMARY KEY) I got results that looked like so, committing every 100,000,... 100000: 9 wallclock secs ( 7.36 usr + 1.63 sys = 8.99 CPU) 200000: 12 wallclock secs ( 8.06 usr + 2.87 sys = 10.93 CPU) 300000: 18 wallclock secs ( 8.27 usr + 3.29 sys = 11.56 CPU) 400000: 14 wallclock secs ( 8.32 usr + 3.50 sys = 11.82 CPU) 500000: 14 wallclock secs ( 8.45 usr + 3.71 sys = 12.16 CPU) 600000: 15 wallclock secs ( 8.49 usr + 3.89 sys = 12.38 CPU) 700000: 16 wallclock secs ( 8.68 usr + 4.15 sys = 12.83 CPU) 800000: 15 wallclock secs ( 8.61 usr + 4.16 sys = 12.77 CPU) 900000: 16 wallclock secs ( 8.69 usr + 4.29 sys = 12.98 CPU) 1000000: 17 wallclock secs ( 8.65 usr + 4.38 sys = 13.03 CPU) Then I tried with CREATE TABLE tbl (fen) It was a more flat curve like so... 100000: 5 wallclock secs ( 5.19 usr + 0.09 sys = 5.28 CPU) 200000: 6 wallclock secs ( 5.23 usr + 0.09 sys = 5.32 CPU) 300000: 5 wallclock secs ( 5.24 usr + 0.09 sys = 5.33 CPU) 400000: 6 wallclock secs ( 5.23 usr + 0.10 sys = 5.33 CPU) 500000: 5 wallclock secs ( 5.22 usr + 0.09 sys = 5.31 CPU) 600000: 5 wallclock secs ( 5.24 usr + 0.10 sys = 5.34 CPU) 700000: 5 wallclock secs ( 5.23 usr + 0.09 sys = 5.32 CPU) 800000: 6 wallclock secs ( 5.24 usr + 0.10 sys = 5.34 CPU) 900000: 6 wallclock secs ( 5.26 usr + 0.10 sys = 5.36 CPU) 1000000: 5 wallclock secs ( 5.24 usr + 0.10 sys = 5.34 CPU) Well, since you want to tag each string with a unique id, how about CREATE TABLE tbl (id INTEGER PRIMARY KEY, fen) I get a flat curve again... 100000: 6 wallclock secs ( 5.22 usr + 0.10 sys = 5.32 CPU) 200000: 5 wallclock secs ( 5.24 usr + 0.09 sys = 5.33 CPU) 300000: 6 wallclock secs ( 5.26 usr + 0.10 sys = 5.36 CPU) 400000: 5 wallclock secs ( 5.26 usr + 0.10 sys = 5.36 CPU) 500000: 5 wallclock secs ( 5.27 usr + 0.11 sys = 5.38 CPU) 600000: 6 wallclock secs ( 5.27 usr + 0.10 sys = 5.37 CPU) 700000: 6 wallclock secs ( 5.27 usr + 0.09 sys = 5.36 CPU) 800000: 6 wallclock secs ( 5.27 usr + 0.10 sys = 5.37 CPU) 900000: 6 wallclock secs ( 5.26 usr + 0.10 sys = 5.36 CPU) 1000000: 5 wallclock secs ( 5.26 usr + 0.10 sys = 5.36 CPU) Problem solved... id is your unique tag. It will be done in about 1.5 hours on my MacBook Pro. -- Puneet Kishor http://punkish.eidesis.org/ Nelson Inst. for Env. Studies, UW-Madison http://www.nelson.wisc.edu/ Open Source Geospatial Foundation http://www.osgeo.org/education/ --------------------------------------------------------------------- collaborate, communicate, compete ===================================================================== ----------------------------------------------------------------------------- To unsubscribe, send email to [EMAIL PROTECTED] -----------------------------------------------------------------------------