Re: [PHP] memory efficient hash table extension? like lchash ...
PHP does expose sys V shared-memory apis (shm_* functions): http://us2.php.net/manual/en/book.sem.php If you already have apc installed, you could also try: http://us2.php.net/manual/en/book.apc.php APC also allows you to store user specific data too (it will be in a shared memory). Haven't tried these myself, so I would do some quick tests to ensure if they meet your performance requirements. In theory, it should be faster than berkeley-db like solutions (which is also another option but it seems something similar like MongoDB was not good enough?). I am curious to know if someone here has run these tests. Note that with memcached installed locally (on the same box running php), it can be surprisingly efficient - using pconnect(), caching the handler in a static var for a given request cycle etc... Ravi On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com wrote: shiplu wrote: On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote: All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. What are you storing? An array of row objects?? In that case storing only the row id is will reduce the memory. I am querying a MySQL database which contains 40 million records and mapping string columns to numeric ids. You might consider it normalizing the data. Then, I am importing a new 40 million records and comparing the new values to the old values. Where the value matches, I update records, but where they do not match, I insert new records, and finally I go back and delete old records. So, the net result is that I have a database with 40 million records that I need to sync on a daily basis. If you are loading full row objects, it will take a lot of memory. But if you just load the row id values, it will significantly decrease the memory amount. For what I am trying to do, I just need to map a string value (32 bytes) to a bigint value (8 bytes) in a fast-lookup hash. Besides, You can load row ids in a chunk by chunk basis. if you have 10 millions of rows to process. load 1 rows as a chunk. process them then load the next chunk. This will significantly reduce memory usage. When importing the fresh 40 million records, I need to compare each record with 4 different indexes that will map the record to existing other records, or into a group_id that the record also belongs to. My current solution uses a trigger in MySQL that will do the lookups inside MySQL, but this is extremely slow. Pre-loading the mysql indexes into PHP ram and processing that was is thousands of times faster. I just need an efficient way to hold my hash tables in PHP ram. PHP arrays are very fast, but like my original post says, they consume way too much ram. A good algorithm can solve your problem anytime. ;-) It takes about 5-10 minutes to build my hash indexes in PHP ram currently which makes up for the 10,000 x speedup on key lookups that I get later on. I just want to not use the whole 6 GB of ram to do this. I need an efficient hashing API that supports something like: $value = (int) fasthash_get((string) $key); $exists = (bool) fasthash_exists((string) $key); fasthash_set((string) $key, (int) $value); Or ... it feels like a memcached api but where the data is stored locally instead of accessed via a network. So this is how my search led me to what appears to be a dead lchash extension. -- Dante -- D. Dante Lorenso da...@lorenso.com 972-333-4139 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] memory efficient hash table extension? like lchash ...
J Ravi Menon wrote: PHP does expose sys V shared-memory apis (shm_* functions): http://us2.php.net/manual/en/book.sem.php I will look into this. I really need a key/value map, though and would rather not have to write my own on top of SHM. If you already have apc installed, you could also try: http://us2.php.net/manual/en/book.apc.php APC also allows you to store user specific data too (it will be in a shared memory). I've looked into the apc_store and apc_fetch routines: http://php.net/manual/en/function.apc-store.php http://www.php.net/manual/en/function.apc-fetch.php ... but quickly ran out of memory for APC and though I figured out how to configure it to use more (adjust shared memory allotment), there were other problems. I ran into issues with logs complaining about cache slamming and other known bugs with APC version 3.1.3p1. Also, after several million values were stored, the APC storage began to slow down *dramatically*. I wasn't certain if APC was using only RAM or was possibly also writing to disk. Performance tanked so quickly that I set it aside as an option and moved on. Haven't tried these myself, so I would do some quick tests to ensure if they meet your performance requirements. In theory, it should be faster than berkeley-db like solutions (which is also another option but it seems something similar like MongoDB was not good enough?). I will run more tests against MongoDB. Initially I tried to use it to store everything. If I only store my indexes, it might fare better. Certainly, though, running queries and updates against a remote server will always be slower than doing the lookups locally in ram. I am curious to know if someone here has run these tests. Note that with memcached installed locally (on the same box running php), it can be surprisingly efficient - using pconnect(), caching the handler in a static var for a given request cycle etc... memcached gives no guarantee about data persistence. I need to have a hash table that will contain all the values I set. They don't need to survive a server shutdown (don't need to be written to disk), but I can not afford for the server to throw away values that don't fit into memory. If there is a way to configure memcached guarantee storage, that might work. -- Dante On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com wrote: shiplu wrote: On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote: All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. What are you storing? An array of row objects?? In that case storing only the row id is will reduce the memory. I am querying a MySQL database which contains 40 million records and mapping string columns to numeric ids. You might consider it normalizing the data. Then, I am importing a new 40 million records and comparing the new values to the old values. Where the value matches, I update records, but where they do not match, I insert new records, and finally I go back and delete old records. So, the net result is that I have a database with 40 million records that I need to sync on a daily basis. If you are loading full row objects, it will take a lot of memory. But if you just load the row id values, it will significantly decrease the memory amount. For what I am trying to do, I just need to map a string value (32 bytes) to a bigint value (8 bytes) in a fast-lookup hash. Besides, You can load row ids in a chunk by chunk basis. if you have 10 millions of rows to process. load 1 rows as a chunk. process them then load the next chunk. This will significantly reduce memory usage. When importing the fresh 40 million records, I need to compare each record with 4 different indexes that will map the record to existing other records, or into a group_id that the record also belongs to. My current solution uses a trigger in MySQL that will do the lookups inside MySQL, but this is extremely slow. Pre-loading the mysql indexes into PHP ram and processing that was is thousands of times faster. I just need an efficient way to hold my hash tables in PHP ram. PHP arrays are very fast, but like my original post says, they consume way too much ram. A good algorithm can solve your problem anytime. ;-) It takes about 5-10 minutes to build my hash indexes in PHP ram currently which makes up for the 10,000 x speedup on key lookups that I get later on. I just want to not use the whole 6 GB of ram to do this. I need an efficient hashing API that supports something like: $value = (int) fasthash_get((string) $key); $exists = (bool) fasthash_exists((string) $key); fasthash_set((string) $key, (int) $value); Or
Re: [PHP] memory efficient hash table extension? like lchash ...
values were stored, the APC storage began to slow down *dramatically*. I wasn't certain if APC was using only RAM or was possibly also writing to disk. Performance tanked so quickly that I set it aside as an option and moved on. IIRC, i think it is built over shm and there is no disk backing store. memcached gives no guarantee about data persistence. I need to have a hash table that will contain all the values I set. They don't need to survive a server shutdown (don't need to be written to disk), but I can not afford for the server to throw away values that don't fit into memory. If there is a way to configure memcached guarantee storage, that might work. True but the lru policy only kicks in lazily. So if you ensure that you never hit near the max allowed limit (-m option), and you store your key-val pairs with no expiry, it will be present till the next restart. So essentially you would have to estimate the value for the -m option to big enough to accommodate all possible key-val pairs (the evictions counter in memcached stats should remain 0). BTW, I have seen this implementation behavior in 1.2.x series but not sure it is necessarily guaranteed in future versions. Ravi On Mon, Jan 25, 2010 at 3:49 PM, D. Dante Lorenso da...@lorenso.com wrote: J Ravi Menon wrote: PHP does expose sys V shared-memory apis (shm_* functions): http://us2.php.net/manual/en/book.sem.php I will look into this. I really need a key/value map, though and would rather not have to write my own on top of SHM. If you already have apc installed, you could also try: http://us2.php.net/manual/en/book.apc.php APC also allows you to store user specific data too (it will be in a shared memory). I've looked into the apc_store and apc_fetch routines: http://php.net/manual/en/function.apc-store.php http://www.php.net/manual/en/function.apc-fetch.php ... but quickly ran out of memory for APC and though I figured out how to configure it to use more (adjust shared memory allotment), there were other problems. I ran into issues with logs complaining about cache slamming and other known bugs with APC version 3.1.3p1. Also, after several million values were stored, the APC storage began to slow down *dramatically*. I wasn't certain if APC was using only RAM or was possibly also writing to disk. Performance tanked so quickly that I set it aside as an option and moved on. Haven't tried these myself, so I would do some quick tests to ensure if they meet your performance requirements. In theory, it should be faster than berkeley-db like solutions (which is also another option but it seems something similar like MongoDB was not good enough?). I will run more tests against MongoDB. Initially I tried to use it to store everything. If I only store my indexes, it might fare better. Certainly, though, running queries and updates against a remote server will always be slower than doing the lookups locally in ram. I am curious to know if someone here has run these tests. Note that with memcached installed locally (on the same box running php), it can be surprisingly efficient - using pconnect(), caching the handler in a static var for a given request cycle etc... memcached gives no guarantee about data persistence. I need to have a hash table that will contain all the values I set. They don't need to survive a server shutdown (don't need to be written to disk), but I can not afford for the server to throw away values that don't fit into memory. If there is a way to configure memcached guarantee storage, that might work. -- Dante On Sun, Jan 24, 2010 at 9:39 AM, D. Dante Lorenso da...@lorenso.com wrote: shiplu wrote: On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote: All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. What are you storing? An array of row objects?? In that case storing only the row id is will reduce the memory. I am querying a MySQL database which contains 40 million records and mapping string columns to numeric ids. You might consider it normalizing the data. Then, I am importing a new 40 million records and comparing the new values to the old values. Where the value matches, I update records, but where they do not match, I insert new records, and finally I go back and delete old records. So, the net result is that I have a database with 40 million records that I need to sync on a daily basis. If you are loading full row objects, it will take a lot of memory. But if you just load the row id values, it will significantly decrease the memory amount. For what I am trying to do, I just need to map a string value (32 bytes) to a
Re: [PHP] memory efficient hash table extension? like lchash ...
shiplu wrote: On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote: All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. What are you storing? An array of row objects?? In that case storing only the row id is will reduce the memory. I am querying a MySQL database which contains 40 million records and mapping string columns to numeric ids. You might consider it normalizing the data. Then, I am importing a new 40 million records and comparing the new values to the old values. Where the value matches, I update records, but where they do not match, I insert new records, and finally I go back and delete old records. So, the net result is that I have a database with 40 million records that I need to sync on a daily basis. If you are loading full row objects, it will take a lot of memory. But if you just load the row id values, it will significantly decrease the memory amount. For what I am trying to do, I just need to map a string value (32 bytes) to a bigint value (8 bytes) in a fast-lookup hash. Besides, You can load row ids in a chunk by chunk basis. if you have 10 millions of rows to process. load 1 rows as a chunk. process them then load the next chunk. This will significantly reduce memory usage. When importing the fresh 40 million records, I need to compare each record with 4 different indexes that will map the record to existing other records, or into a group_id that the record also belongs to. My current solution uses a trigger in MySQL that will do the lookups inside MySQL, but this is extremely slow. Pre-loading the mysql indexes into PHP ram and processing that was is thousands of times faster. I just need an efficient way to hold my hash tables in PHP ram. PHP arrays are very fast, but like my original post says, they consume way too much ram. A good algorithm can solve your problem anytime. ;-) It takes about 5-10 minutes to build my hash indexes in PHP ram currently which makes up for the 10,000 x speedup on key lookups that I get later on. I just want to not use the whole 6 GB of ram to do this. I need an efficient hashing API that supports something like: $value = (int) fasthash_get((string) $key); $exists = (bool) fasthash_exists((string) $key); fasthash_set((string) $key, (int) $value); Or ... it feels like a memcached api but where the data is stored locally instead of accessed via a network. So this is how my search led me to what appears to be a dead lchash extension. -- Dante -- D. Dante Lorenso da...@lorenso.com 972-333-4139 -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
[PHP] memory efficient hash table extension? like lchash ...
All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. I have tried using some external key/value storage solutions like MemcacheDB, MongoDB, and straight MySQL, but none of these are fast enough for what I'm trying to do. Then I found the lchash extension for PHP and it looks like exactly what I want. It's a c-lib hash which is accessed from PHP. Using it would be slightly slower than using straight PHP arrays, but would be much more memory efficient since not all data needs to be stored as PHP zvals, etc. Problem is that the lchash extension can't ben installed in my PHP 5.3 build because pecl install lchash fails with a message about invalid checksum on the README file. Apparently this extension has been neglected and abandoned and hasn't been updated since 2005. Is there something like lchash that *is* being maintained? What would you all suggest? -- Dante -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP] memory efficient hash table extension? like lchash ...
On Sun, Jan 24, 2010 at 3:11 AM, D. Dante Lorenso da...@lorenso.com wrote: All, I'm loading millions of records into a backend PHP cli script that I need to build a hash index from to optimize key lookups for data that I'm importing into a MySQL database. The problem is that storing this data in a PHP array is not very memory efficient and my millions of records are consuming about 4-6 GB of ram. What are you storing? An array of row objects?? In that case storing only the row id is will reduce the memory. If you are loading full row objects, it will take a lot of memory. But if you just load the row id values, it will significantly decrease the memory amount. Besides, You can load row ids in a chunk by chunk basis. if you have 10 millions of rows to process. load 1 rows as a chunk. process them then load the next chunk. This will significantly reduce memory usage. A good algorithm can solve your problem anytime. ;-) -- Shiplu Mokaddim My talks, http://talk.cmyweb.net Follow me, http://twitter.com/shiplu SUST Programmers, http://groups.google.com/group/p2psust Innovation distinguishes bet ... ... (ask Steve Jobs the rest) -- PHP General Mailing List (http://www.php.net/) To unsubscribe, visit: http://www.php.net/unsub.php