On Mon, 2007-01-08 at 15:52, George Nychis wrote: > Scott Marlowe wrote: > > On Mon, 2007-01-08 at 14:58, George Nychis wrote: > >> Hi, > >> > >> I have approximately 2 billion data entries that I would like to insert > >> into a database. > >> Each entry consists of: > >> INT BOOLEAN INT BOOLEAN > >> > >> I want to populate a table such that it only contains the unique rows, all > >> other data > >> should be thrown out. I would say a significant amount of the insertions > >> are going to > >> fail due to unique constraints. The unique constraint is on the two > >> integers, not on the > >> booleans. > >> > >> Using mysql, I was able to do this with the following query, for all data > >> files (25574 > >> data files total): > >> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO > >> TABLE ipconn > >> FIELDS TERMINATED BY ' ';\" > > > > A quick question. Could you run selects or other inserts on that table > > while the load data infile was running? Cause I'm guessing that it > > basically locked the whole table while running. > What does this have to do with my question? I don't need to run selects or > inserts on the > table while the load data is running... > > > > >> What I *think* mysql did was sort each data file and do a sort of merge > >> sort between the > >> data I was inserting and the data in the database. It would insert the > >> first unique > >> instance of a row it saw, and reject all other insertions that violated > >> the unique > >> constraint due to the "IGNORE". > > > > Me too. Which would require "one big lock" on the table which would > > mean no parallel access. > Thats fine, it doesn't matter. > > > > > It's also likely that it used a temp table which doubled the size of the > > database while you were inserting. > > > >> From what I understand, this functionality is not in postgresql. Fine, I > >> certainly can't > >> change that. But I am looking for a comparable solution for the size of > >> my data. > >> > >> One solution is to have a temporary table, insert all 2 billion rows, and > >> then copy the > >> distinct entries to another table. This would be like one massive sort? > >> > >> Is this the only/best solution using postgresql? > > > > TANSTAAFL. PostgreSQL is designed so that you can run an import process > > on that table while 100 other users still access it at the same time. > > Because of that, you don't get to do dirty, nasty things under the > > sheets that allow for super easy data loading and merging like you got > > with MySQL. Apples and Oranges. > > > > Assuming you're loading into an empty table, the load to temp, select > > distinct out and into the final table seems reasonable, should run > > reasonably fast. If you need to load to an existing table, it might get > > a little more complex. > > > > The goal is not to run queries while the data is being inserted....I am > wondering if the > postgresql method I have mentioned to actually insert and get only distinct > values is most > optimal, which would produce the same results method I explained in mysql.
Did I fail to answer your question? Sorry if I gave you more information than you needed. Please feel free to ask someone else next time. ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster