Gerry Snyder <[EMAIL PROTECTED]> wrote:
> Chris Jones wrote:
> > Hi all,
> > 
> > I have a very simple schema.  I need to assign a unique identifier to a
> > large collection of strings, each at most 80-bytes, although typically
> > shorter.
> > 
> > The problem is I have 112 million of them.
> 
> Maybe you could start by breaking the data into 8 equal groups and make 
> a table of each group. Then merge the original groups pairwise, then 
> merge those 4 groups, and finally the 2 semifinal groups (kinda like 
> March Madness, come to think of it). Since each merging will be of 
> already sorted/indexed data, it might save a lot of time.
> 

This is the right idea.

The problem is that your working set is bigger than your cache
which is causing thrashing.  I suggest a solution like this:

Add entries to table ONE until the table and its unique index get
so big that they no longer fit in cache all at once.  Then
transfer ONE into TWO like this:

   INSERT INTO two SELECT * FROM one ORDER BY unique_column;

The ORDER BY is important here.

Do the above a time or two until TWO is signficantly larger than ONE.
Then do the same into TWO_B.  Later combine TWO and TWO_B into THREE:

   INSERT INTO three SELECT * FROM two ORDER BY unique_column;
   INSERT INTO three SELECT * FROM two_b ORDER BY unique_column;

Repeat as necessary for FOUR, FIVE, SIX and so forth.

--
D. Richard Hipp  <[EMAIL PROTECTED]>


-----------------------------------------------------------------------------
To unsubscribe, send email to [EMAIL PROTECTED]
-----------------------------------------------------------------------------

Reply via email to