shiplu wrote:
On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso <da...@lorenso.com> wrote:
All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database.  The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.


What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.

I am querying a MySQL database which contains 40 million records and mapping string columns to numeric ids. You might consider it normalizing the data.

Then, I am importing a new 40 million records and comparing the new values to the old values. Where the value matches, I update records, but where they do not match, I insert new records, and finally I go back and delete old records. So, the net result is that I have a database with 40 million records that I need to "sync" on a daily basis.

If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.

For what I am trying to do, I just need to map a string value (32 bytes) to a bigint value (8 bytes) in a fast-lookup hash.

Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 10000 rows as a chunk. process
them then load the next chunk.  This will significantly reduce memory
usage.

When importing the fresh 40 million records, I need to compare each record with 4 different indexes that will map the record to existing other records, or into a "group_id" that the record also belongs to. My current solution uses a trigger in MySQL that will do the lookups inside MySQL, but this is extremely slow. Pre-loading the mysql indexes into PHP ram and processing that was is thousands of times faster.

I just need an efficient way to hold my hash tables in PHP ram. PHP arrays are very fast, but like my original post says, they consume way too much ram.

A good algorithm can solve your problem anytime. ;-)

It takes about 5-10 minutes to build my hash indexes in PHP ram currently which makes up for the 10,000 x speedup on key lookups that I get later on. I just want to not use the whole 6 GB of ram to do this. I need an efficient hashing API that supports something like:

        $value = (int) fasthash_get((string) $key);
        $exists = (bool) fasthash_exists((string) $key);
        fasthash_set((string) $key, (int) $value);

Or ... it feels like a "memcached" api but where the data is stored locally instead of accessed via a network. So this is how my search led me to what appears to be a dead "lchash" extension.

-- Dante

----------
D. Dante Lorenso
da...@lorenso.com
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Reply via email to