Re: [PHP] memory efficient hash table extension? like lchash ...

2010-01-25 Thread J Ravi Menon
PHP does expose sys V shared-memory apis (shm_* functions):

http://us2.php.net/manual/en/book.sem.php

If you already have apc installed, you could also try:

http://us2.php.net/manual/en/book.apc.php

APC also allows you to store user specific data too (it will be in a
shared memory).

Haven't tried these myself, so I would do some quick tests to ensure
if they meet your performance requirements. In theory, it should be
faster than berkeley-db like solutions (which is also another option
but it seems something similar like MongoDB was not good enough?).

I  am curious to know if someone here has run these tests. Note that
with memcached installed locally (on the same box running php), it can
be surprisingly efficient - using pconnect(),  caching the handler in
a static var for a given request cycle etc...

Ravi






On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com wrote:
 shiplu wrote:

 On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com
 wrote:

 All,

 I'm loading millions of records into a backend PHP cli script that I
 need to build a hash index from to optimize key lookups for data that
 I'm importing into a MySQL database.  The problem is that storing this
 data in a PHP array is not very memory efficient and my millions of
 records are consuming about 4-6 GB of ram.


 What are you storing? An array of row objects??
 In that case storing only the row id is will reduce the memory.

 I am querying a MySQL database which contains 40 million records and mapping
 string columns to numeric ids.  You might consider it normalizing the data.

 Then, I am importing a new 40 million records and comparing the new values
 to the old values.  Where the value matches, I update records, but where
 they do not match, I insert new records, and finally I go back and delete
 old records.  So, the net result is that I have a database with 40 million
 records that I need to sync on a daily basis.

 If you are loading full row objects, it will take a lot of memory.
 But if you just load the row id values, it will significantly decrease
 the memory amount.

 For what I am trying to do, I just need to map a string value (32 bytes) to
 a bigint value (8 bytes) in a fast-lookup hash.

 Besides, You can load row ids in a chunk by chunk basis. if you have
 10 millions of rows to process. load 1 rows as a chunk. process
 them then load the next chunk.  This will significantly reduce memory
 usage.

 When importing the fresh 40 million records, I need to compare each record
 with 4 different indexes that will map the record to existing other records,
 or into a group_id that the record also belongs to.  My current solution
 uses a trigger in MySQL that will do the lookups inside MySQL, but this is
 extremely slow.  Pre-loading the mysql indexes into PHP ram and processing
 that was is thousands of times faster.

 I just need an efficient way to hold my hash tables in PHP ram.  PHP arrays
 are very fast, but like my original post says, they consume way too much
 ram.

 A good algorithm can solve your problem anytime. ;-)

 It takes about 5-10 minutes to build my hash indexes in PHP ram currently
 which makes up for the 10,000 x speedup on key lookups that I get later on.
  I just want to not use the whole 6 GB of ram to do this.   I need an
 efficient hashing API that supports something like:

        $value = (int) fasthash_get((string) $key);
        $exists = (bool) fasthash_exists((string) $key);
        fasthash_set((string) $key, (int) $value);

 Or ... it feels like a memcached api but where the data is stored locally
 instead of accessed via a network.  So this is how my search led me to what
 appears to be a dead lchash extension.

 -- Dante

 --
 D. Dante Lorenso
 da...@lorenso.com
 972-333-4139

 --
 PHP General Mailing List (http://www.php.net/)
 To unsubscribe, visit: http://www.php.net/unsub.php



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] memory efficient hash table extension? like lchash ...

2010-01-25 Thread D. Dante Lorenso

J Ravi Menon wrote:

PHP does expose sys V shared-memory apis (shm_* functions):
http://us2.php.net/manual/en/book.sem.php



I will look into this.  I really need a key/value map, though and would 
rather not have to write my own on top of SHM.




If you already have apc installed, you could also try:
http://us2.php.net/manual/en/book.apc.php
APC also allows you to store user specific data too (it will be in a
shared memory).



I've looked into the apc_store and apc_fetch routines:
http://php.net/manual/en/function.apc-store.php
http://www.php.net/manual/en/function.apc-fetch.php
... but quickly ran out of memory for APC and though I figured out how 
to configure it to use more (adjust shared memory allotment), there were 
other problems.  I ran into issues with logs complaining about cache 
slamming and other known bugs with APC version 3.1.3p1.  Also, after 
several million values were stored, the APC storage began to slow down 
*dramatically*.  I wasn't certain if APC was using only RAM or was 
possibly also writing to disk.  Performance tanked so quickly that I set 
it aside as an option and moved on.




Haven't tried these myself, so I would do some quick tests to ensure
if they meet your performance requirements. In theory, it should be
faster than berkeley-db like solutions (which is also another option
but it seems something similar like MongoDB was not good enough?).



I will run more tests against MongoDB.  Initially I tried to use it to 
store everything.  If I only store my indexes, it might fare better. 
Certainly, though, running queries and updates against a remote server 
will always be slower than doing the lookups locally in ram.




I  am curious to know if someone here has run these tests. Note that
with memcached installed locally (on the same box running php), it can
be surprisingly efficient - using pconnect(),  caching the handler in
a static var for a given request cycle etc...


memcached gives no guarantee about data persistence.  I need to have a 
hash table that will contain all the values I set.  They don't need to 
survive a server shutdown (don't need to be written to disk), but I can 
not afford for the server to throw away values that don't fit into 
memory.  If there is a way to configure memcached guarantee storage, 
that might work.


-- Dante



On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com wrote:

shiplu wrote:

On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com
wrote:

All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database.  The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.


What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.

I am querying a MySQL database which contains 40 million records and mapping
string columns to numeric ids.  You might consider it normalizing the data.

Then, I am importing a new 40 million records and comparing the new values
to the old values.  Where the value matches, I update records, but where
they do not match, I insert new records, and finally I go back and delete
old records.  So, the net result is that I have a database with 40 million
records that I need to sync on a daily basis.


If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.

For what I am trying to do, I just need to map a string value (32 bytes) to
a bigint value (8 bytes) in a fast-lookup hash.


Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 1 rows as a chunk. process
them then load the next chunk.  This will significantly reduce memory
usage.

When importing the fresh 40 million records, I need to compare each record
with 4 different indexes that will map the record to existing other records,
or into a group_id that the record also belongs to.  My current solution
uses a trigger in MySQL that will do the lookups inside MySQL, but this is
extremely slow.  Pre-loading the mysql indexes into PHP ram and processing
that was is thousands of times faster.

I just need an efficient way to hold my hash tables in PHP ram.  PHP arrays
are very fast, but like my original post says, they consume way too much
ram.


A good algorithm can solve your problem anytime. ;-)

It takes about 5-10 minutes to build my hash indexes in PHP ram currently
which makes up for the 10,000 x speedup on key lookups that I get later on.
 I just want to not use the whole 6 GB of ram to do this.   I need an
efficient hashing API that supports something like:

   $value = (int) fasthash_get((string) $key);
   $exists = (bool) fasthash_exists((string) $key);
   fasthash_set((string) $key, (int) $value);

Or 

Re: [PHP] memory efficient hash table extension? like lchash ...

2010-01-25 Thread J Ravi Menon
 values were stored, the APC storage began to slow down *dramatically*.  I
 wasn't certain if APC was using only RAM or was possibly also writing to
 disk.  Performance tanked so quickly that I set it aside as an option and
 moved on.
IIRC, i think it is built over shm and there is no disk backing store.


 memcached gives no guarantee about data persistence.  I need to have a hash
 table that will contain all the values I set.  They don't need to survive a
 server shutdown (don't need to be written to disk), but I can not afford for
 the server to throw away values that don't fit into memory.  If there is a
 way to configure memcached guarantee storage, that might work.

True but the lru policy only kicks in lazily. So if you ensure that
you never hit near the max allowed limit (-m option), and you store
your key-val pairs with no expiry, it will be present till the next
restart. So essentially you would have to estimate the value for the
-m option to big enough to accommodate all possible key-val pairs (the
evictions counter in memcached stats should remain 0). BTW, I have
seen this implementation behavior in 1.2.x series but not sure it is
necessarily guaranteed in future versions.

Ravi



On Mon, Jan 25, 2010 at 3:49 PM, D. Dante Lorenso da...@lorenso.com wrote:
 J Ravi Menon wrote:

 PHP does expose sys V shared-memory apis (shm_* functions):
 http://us2.php.net/manual/en/book.sem.php


 I will look into this.  I really need a key/value map, though and would
 rather not have to write my own on top of SHM.


 If you already have apc installed, you could also try:
 http://us2.php.net/manual/en/book.apc.php
 APC also allows you to store user specific data too (it will be in a
 shared memory).


 I've looked into the apc_store and apc_fetch routines:
 http://php.net/manual/en/function.apc-store.php
 http://www.php.net/manual/en/function.apc-fetch.php
 ... but quickly ran out of memory for APC and though I figured out how to
 configure it to use more (adjust shared memory allotment), there were other
 problems.  I ran into issues with logs complaining about cache slamming
 and other known bugs with APC version 3.1.3p1.  Also, after several million
 values were stored, the APC storage began to slow down *dramatically*.  I
 wasn't certain if APC was using only RAM or was possibly also writing to
 disk.  Performance tanked so quickly that I set it aside as an option and
 moved on.


 Haven't tried these myself, so I would do some quick tests to ensure
 if they meet your performance requirements. In theory, it should be
 faster than berkeley-db like solutions (which is also another option
 but it seems something similar like MongoDB was not good enough?).


 I will run more tests against MongoDB.  Initially I tried to use it to store
 everything.  If I only store my indexes, it might fare better. Certainly,
 though, running queries and updates against a remote server will always be
 slower than doing the lookups locally in ram.


 I  am curious to know if someone here has run these tests. Note that
 with memcached installed locally (on the same box running php), it can
 be surprisingly efficient - using pconnect(),  caching the handler in
 a static var for a given request cycle etc...

 memcached gives no guarantee about data persistence.  I need to have a hash
 table that will contain all the values I set.  They don't need to survive a
 server shutdown (don't need to be written to disk), but I can not afford for
 the server to throw away values that don't fit into memory.  If there is a
 way to configure memcached guarantee storage, that might work.

 -- Dante


 On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com
 wrote:

 shiplu wrote:

 On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com
 wrote:

 All,

 I'm loading millions of records into a backend PHP cli script that I
 need to build a hash index from to optimize key lookups for data that
 I'm importing into a MySQL database.  The problem is that storing this
 data in a PHP array is not very memory efficient and my millions of
 records are consuming about 4-6 GB of ram.

 What are you storing? An array of row objects??
 In that case storing only the row id is will reduce the memory.

 I am querying a MySQL database which contains 40 million records and
 mapping
 string columns to numeric ids.  You might consider it normalizing the
 data.

 Then, I am importing a new 40 million records and comparing the new
 values
 to the old values.  Where the value matches, I update records, but where
 they do not match, I insert new records, and finally I go back and delete
 old records.  So, the net result is that I have a database with 40
 million
 records that I need to sync on a daily basis.

 If you are loading full row objects, it will take a lot of memory.
 But if you just load the row id values, it will significantly decrease
 the memory amount.

 For what I am trying to do, I just need to map a string value (32 bytes)
 to
 a 

Re: [PHP] memory efficient hash table extension? like lchash ...

2010-01-24 Thread D. Dante Lorenso

shiplu wrote:

On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote:

All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database.  The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.



What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.


I am querying a MySQL database which contains 40 million records and 
mapping string columns to numeric ids.  You might consider it 
normalizing the data.


Then, I am importing a new 40 million records and comparing the new 
values to the old values.  Where the value matches, I update records, 
but where they do not match, I insert new records, and finally I go back 
and delete old records.  So, the net result is that I have a database 
with 40 million records that I need to sync on a daily basis.



If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.


For what I am trying to do, I just need to map a string value (32 bytes) 
to a bigint value (8 bytes) in a fast-lookup hash.



Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 1 rows as a chunk. process
them then load the next chunk.  This will significantly reduce memory
usage.


When importing the fresh 40 million records, I need to compare each 
record with 4 different indexes that will map the record to existing 
other records, or into a group_id that the record also belongs to.  My 
current solution uses a trigger in MySQL that will do the lookups inside 
MySQL, but this is extremely slow.  Pre-loading the mysql indexes into 
PHP ram and processing that was is thousands of times faster.


I just need an efficient way to hold my hash tables in PHP ram.  PHP 
arrays are very fast, but like my original post says, they consume way 
too much ram.



A good algorithm can solve your problem anytime. ;-)


It takes about 5-10 minutes to build my hash indexes in PHP ram 
currently which makes up for the 10,000 x speedup on key lookups that I 
get later on.  I just want to not use the whole 6 GB of ram to do this. 
   I need an efficient hashing API that supports something like:


$value = (int) fasthash_get((string) $key);
$exists = (bool) fasthash_exists((string) $key);
fasthash_set((string) $key, (int) $value);

Or ... it feels like a memcached api but where the data is stored 
locally instead of accessed via a network.  So this is how my search led 
me to what appears to be a dead lchash extension.


-- Dante

--
D. Dante Lorenso
da...@lorenso.com
972-333-4139

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP] memory efficient hash table extension? like lchash ...

2010-01-23 Thread D. Dante Lorenso

All,

I'm loading millions of records into a backend PHP cli script that I
need to build a hash index from to optimize key lookups for data that
I'm importing into a MySQL database.  The problem is that storing this
data in a PHP array is not very memory efficient and my millions of
records are consuming about 4-6 GB of ram.

I have tried using some external key/value storage solutions like
MemcacheDB, MongoDB, and straight MySQL, but none of these are fast
enough for what I'm trying to do.

Then I found the lchash extension for PHP and it looks like exactly
what I want.  It's a c-lib hash which is accessed from PHP.  Using it
would be slightly slower than using straight PHP arrays, but would be
much more memory efficient since not all data needs to be stored as PHP
zvals, etc.

Problem is that the lchash extension can't ben installed in my PHP 5.3 
build because pecl install lchash fails with a message about invalid 
checksum on the README file.  Apparently this extension has been 
neglected and abandoned and hasn't been updated since 2005.


Is there something like lchash that *is* being maintained?  What would 
you all suggest?


-- Dante


--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP] memory efficient hash table extension? like lchash ...

2010-01-23 Thread shiplu
On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote:
 All,

 I'm loading millions of records into a backend PHP cli script that I
 need to build a hash index from to optimize key lookups for data that
 I'm importing into a MySQL database.  The problem is that storing this
 data in a PHP array is not very memory efficient and my millions of
 records are consuming about 4-6 GB of ram.


What are you storing? An array of row objects??
In that case storing only the row id is will reduce the memory.

If you are loading full row objects, it will take a lot of memory.
But if you just load the row id values, it will significantly decrease
the memory amount.

Besides, You can load row ids in a chunk by chunk basis. if you have
10 millions of rows to process. load 1 rows as a chunk. process
them then load the next chunk.  This will significantly reduce memory
usage.

A good algorithm can solve your problem anytime. ;-)

-- 
Shiplu Mokaddim
My talks, http://talk.cmyweb.net
Follow me, http://twitter.com/shiplu
SUST Programmers, http://groups.google.com/group/p2psust
Innovation distinguishes bet ... ... (ask Steve Jobs the rest)

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php