Scott Marlowe wrote:
> On Mon, 2007-01-08 at 15:52, George Nychis wrote:
>> Scott Marlowe wrote:
>>> On Mon, 2007-01-08 at 14:58, George Nychis wrote:
>>>> Hi,
>>>>
>>>> I have approximately 2 billion data entries that I would like to insert 
>>>> into a database.
>>>> Each entry consists of:
>>>> INT BOOLEAN INT BOOLEAN
>>>>
>>>> I want to populate a table such that it only contains the unique rows, all 
>>>> other data
>>>> should be thrown out.  I would say a significant amount of the insertions 
>>>> are going to
>>>> fail due to unique constraints.  The unique constraint is on the two 
>>>> integers, not on the
>>>> booleans.
>>>>
>>>> Using mysql, I was able to do this with the following query, for all data 
>>>> files (25574
>>>> data files total):
>>>> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO 
>>>> TABLE ipconn
>>>> FIELDS TERMINATED BY ' ';\"
>>> A quick question.  Could you run selects or other inserts on that table
>>> while the load data infile was running?  Cause I'm guessing that it
>>> basically locked the whole table while running.
>> What does this have to do with my question?  I don't need to run selects or 
>> inserts on the
>> table while the load data is running...
>>
>>>> What I *think* mysql did was sort each data file and do a sort of merge 
>>>> sort between the
>>>> data I was inserting and the data in the database.  It would insert the 
>>>> first unique
>>>> instance of a row it saw, and reject all other insertions that violated 
>>>> the unique
>>>> constraint due to the "IGNORE".
>>> Me too.  Which would require "one big lock" on the table which would
>>> mean no parallel access.
>> Thats fine, it doesn't matter.
>>
>>> It's also likely that it used a temp table which doubled the size of the
>>> database while you were inserting.
>>>
>>>> From what I understand, this functionality is not in postgresql.  Fine, I 
>>>> certainly can't
>>>> change that.  But I am looking for a comparable solution for the size of 
>>>> my data.
>>>>
>>>> One solution is to have a temporary table, insert all 2 billion rows, and 
>>>> then copy the
>>>> distinct entries to another table.  This would be like one massive sort?
>>>>
>>>> Is this the only/best solution using postgresql?
>>> TANSTAAFL.  PostgreSQL is designed so that you can run an import process
>>> on that table while 100 other users still access it at the same time. 
>>> Because of that, you don't get to do dirty, nasty things under the
>>> sheets that allow for super easy data loading and merging like you got
>>> with MySQL.  Apples and Oranges.
>>>
>>> Assuming you're loading into an empty table, the load to temp, select
>>> distinct out and into the final table seems reasonable, should run
>>> reasonably fast.  If you need to load to an existing table, it might get
>>> a little more complex.
>>>
>> The goal is not to run queries while the data is being inserted....I am 
>> wondering if the
>> postgresql method I have mentioned to actually insert and get only distinct 
>> values is most
>> optimal, which would produce the same results method I explained in mysql.
> 
> Did I fail to answer your question?
> 
> Sorry if I gave you more information than you needed.  Please feel free
> to ask someone else next time.
> 

ahhh i missed your last paragraph... so much text.  Actually yeah that answers 
my
question, thank you.  I guess its more a single run through than the mysql 
method which
was piece-wise.

Thanks for the help/response.

- George

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to