Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Mon, Nov 21, 2011 at 6:14 PM, Tomas Vondra t...@fuzzy.cz wrote: . An index on (a, b) can be used for queries involving only a but not for those involving only b. That is not true since 8.2 - a multi-column index may be used even for queries without conditions on leading columns. It won't be as effective as with conditions on leading columns, because the whole index must be scanned, but it's usually much cheaper than keeping two indexes (memory requirements, overhead when inserting data etc.) Check this: http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html Thanks Tomas. VERY useful information. I've decided to go with a unique multicolumn index for now. Will ask the experts here if I see some issues.. Thanks! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On 21 Listopad 2011, 4:17, David Johnston wrote: On Nov 20, 2011, at 20:50, Phoenix Kiula phoenix.ki...@gmail.com wrote: On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: How about having 2 indexes: one on each of ip url_md5? Pg will combine the indexes as required, or will just use one if that is best. Thanks Gavin. Question: what if I have a joined index? If from a joined index I only use the first column (say, ip) will a joined index still be used? It is cleaner to create two indexes for the two columns. Which is recommended? An index on (a, b) can be used for queries involving only a but not for those involving only b. That is not true since 8.2 - a multi-column index may be used even for queries without conditions on leading columns. It won't be as effective as with conditions on leading columns, because the whole index must be scanned, but it's usually much cheaper than keeping two indexes (memory requirements, overhead when inserting data etc.) Check this: http://www.postgresql.org/docs/8.2/interactive/indexes-multicolumn.html Tomas -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Nov 20, 2011, at 21:33, Phoenix Kiula phoenix.ki...@gmail.com wrote: My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. But the index based on this became huge, as some URLs are gigantic. so I introduced an md5 of the URL: url_md5 | varchar(32) I don't understand this. So you want to allow exactly one URL for an IP address? Or do you want to allow only one row with the same (IP, URL) values? Because that's exactly what a UNIQUE index on (IP, URL) does. If you want to allow just a single URL for an IP, you should create an index on IP only. I now have two scenarios: 1. To have an index (unique?) on (ip, url_md5) 2. To not have an index on just the ip. This way a query that tries to match ...WHERE ip = 999 AND url_md5 = 'md5 here'... will still look only at the ip bit of the index, then refine it with the url_md5. Have you actually performed any benchmarks with this? Just create the indexes, run several (say 100) queries and that should give you an idea which of the options is better. We can speculate on the pros/cons of those options, but both are viable under certain conditions. For example if there's just a very small number of URLs for an IP, then #2 is probably going to be better. But if there's enormous number of URLs per IP, then a multi-column index is probably going to perform better. But those are only guesses - try both options, run a few queries (not on the overloaded system you're struggling with - that'd skew the results) and show us EXPLAIN ANALYZE of the queries. Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) You can obviously take an MD5 hash, which is just a 16-byte value and use the first 4 bytes as an integer. The problem with this is that it probably significantly increases the collision. I.e. it will indicate two URLs to be the same, although the URLs are different. 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. What is the index size, anyway? You've mentioned it's huge, but what is the actual size? With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Well, partitioning is usually a good approach to problems like these. Have you thought about moving the URLs into a separate table? I.e. instead of this table alias | character varying(35) url | text url_md5 | character varying(32) modify_date | timestamp without time zone ip | bigint you'd have two tables - one for keeping the URLs: id | integer url | text url_md5 | character varying(32) and the current one, referencing the URLs alias | character varying(35) url_id | integer modify_date | timestamp without time zone ip | bigint That'd allow you to create a UNIQUE index on (ip, url_id), which should be much smaller than the current one. But handling the inserts would be significantly more complex (you'd have to check existence of the URL, insert it etc.). Tomas -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
[GENERAL] Table Design question for gurus (without going to NoSQL)...
Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. But the index based on this became huge, as some URLs are gigantic. so I introduced an md5 of the URL: url_md5 | varchar(32) I now have two scenarios: 1. To have an index (unique?) on (ip, url_md5) 2. To not have an index on just the ip. This way a query that tries to match ...WHERE ip = 999 AND url_md5 = 'md5 here'... will still look only at the ip bit of the index, then refine it with the url_md5. The good thing about #2 is the size of index remains very small with only a bigint field (ip) being indexed. The bad thing about #2 is that each query of ...WHERE ip = 999 AND url_md5 = 'md5 here'... will have to refine the indexed IP. If one IP address has put in a lot of URLs, then this becomes a bit slow. As is now happening, where I have users who have over 1 million URLs each! Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Many thanks for any ideas or pointers! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula phoenix.ki...@gmail.com wrote: Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. But the index based on this became huge, as some URLs are gigantic. so I introduced an md5 of the URL: url_md5 | varchar(32) I now have two scenarios: 1. To have an index (unique?) on (ip, url_md5) 2. To not have an index on just the ip. This way a query that tries to match ...WHERE ip = 999 AND url_md5 = 'md5 here'... will still look only at the ip bit of the index, then refine it with the url_md5. The good thing about #2 is the size of index remains very small with only a bigint field (ip) being indexed. The bad thing about #2 is that each query of ...WHERE ip = 999 AND url_md5 = 'md5 here'... will have to refine the indexed IP. If one IP address has put in a lot of URLs, then this becomes a bit slow. As is now happening, where I have users who have over 1 million URLs each! Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Many thanks for any ideas or pointers! I thought of adding a bigserial (serial8) column instead of varchar(32) for the md5. But postgresql tells me that: -- ERROR: type bigserial does not exist -- Why is this? Why can't I create a column with this type? Whats the current syntax? Thanks. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
partition your table if it is too big. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Nov 20, 2011, at 8:33, Phoenix Kiula phoenix.ki...@gmail.com wrote: Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint While the schema is useful you need to provide HOW the data is being used if you want to help on finding ways to improve performance. For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. Give a base URL can you encode an algorithm to generate the user-specific URL on-demand; then maybe cache that result in the application. But the index based on this became huge, as some URLs are gigantic. so I introduced What does this mean? Are there any patterns to the URLs that you can leverage (like, say, grouping them by domain name)? Is there a lot of overlap between users so that having a URL table with a biting PK would make a difference? an md5 of the URL: url_md5 | varchar(32) I now have two scenarios: 1. To have an index (unique?) on (ip, url_md5) 2. To not have an index on just the ip. This way a query that tries to match ...WHERE ip = 999 AND url_md5 = 'md5 here'... will still look only at the ip bit of the index, then refine it with the url_md5. The good thing about #2 is the size of index remains very small with only a bigint field (ip) being indexed. The bad thing about #2 is that each query of ...WHERE ip = 999 AND url_md5 = 'md5 here'... will have to refine the indexed IP. If one IP address has put in a lot of URLs, then this becomes a bit slow. As is now happening, where I have users who have over 1 million URLs each! Create a additional partial index on the URL for any IP address with more than X number of records? You smallish users the only need to use the IP. Index while the big ones use that PLUS their personal URL index. Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. If you are going to discount the feature whose implementation solves this specific problem then you are basically asking the list to solve your specific problem and, from my comment above, to do so without providing sufficient details as to how your application works. Also, WTF do you mean by etc. If you are going to discount something from consideration you should be able to exactly specify what it is. Furthermore, if you ask the question and exclude possible solutions you should explain why you cannot use them so that people will not propose other solutions that would have the same faults. With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? You would implement these before you would partition? There are likely multiple solutions to your problem but, again, simply giving a table schema doesn't help it determining which ones are feasible. Many thanks for any ideas or pointers! The only data ignorant, and thus generally useful, PostgreSQL solution is table partitioning. Use It. My other questions, while an interesting thought exercise, need intimate knowledge of the data to even evaluate if they make sense. So, in short, use partitions. If you cannot, provide reasons why and then include more details about the application and data so that meaningful solutions have a chance to be suggested. David J. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Sunday, November 20, 2011 7:12:59 am Phoenix Kiula wrote: On Sun, Nov 20, 2011 at 9:33 PM, Phoenix Kiula phoenix.ki...@gmail.com wrote: I thought of adding a bigserial (serial8) column instead of varchar(32) for the md5. But postgresql tells me that: -- ERROR: type bigserial does not exist -- Why is this? Why can't I create a column with this type? Whats the current syntax? bigserial is not a type so much as a macro that creates a bigint column with attached sequence. Example: test(5432)aklaver=\d pk_test Table public.pk_test Column | Type | Modifiers +-+--- id | integer | not null fld_1 | text| Indexes: pk PRIMARY KEY, btree (id) test(5432)aklaver=ALTER TABLE pk_test ADD column bg bigserial; NOTICE: ALTER TABLE will create implicit sequence pk_test_bg_seq for serial column pk_test.bg ALTER TABLE test(5432)aklaver=\d pk_test Table public.pk_test Column | Type | Modifiers +-+-- id | integer | not null fld_1 | text| bg | bigint | not null default nextval('pk_test_bg_seq'::regclass) Indexes: pk PRIMARY KEY, btree (id) Thanks. -- Adrian Klaver adrian.kla...@gmail.com -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On 21/11/11 02:33, Phoenix Kiula wrote: Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: alias | character varying(35) url | text modify_date | timestamp without time zone ip | bigint For each IP address (user of my application) I want to have a unique URL. So I used to have a UNIQUE constraint on IP, URL. But the index based on this became huge, as some URLs are gigantic. so I introduced an md5 of the URL: url_md5 | varchar(32) I now have two scenarios: 1. To have an index (unique?) on (ip, url_md5) 2. To not have an index on just the ip. This way a query that tries to match ...WHERE ip = 999 AND url_md5 = 'md5 here'... will still look only at the ip bit of the index, then refine it with the url_md5. The good thing about #2 is the size of index remains very small with only a bigint field (ip) being indexed. The bad thing about #2 is that each query of ...WHERE ip = 999 AND url_md5 = 'md5 here'... will have to refine the indexed IP. If one IP address has put in a lot of URLs, then this becomes a bit slow. As is now happening, where I have users who have over 1 million URLs each! Questions: 1. Instead of md5, is there any integer hashing algorithm that will allow me to have a bigint column and save a lot hopefully in both storage space and speed? (Some very useful points mentioned here: http://stackoverflow.com/questions/1422725/represent-md5-hash-as-an-integer ) 2. If I do go with the above scenario #1 of a joint index, is there any way I can save space and maintain speed? Partitioning etc are out of the question. With a growing web database, I am sure many people face this situation. Are nosql type databases the only sane solution to such massive volumes and throughput expectations (e.g., CouchDb's MemBase)? Many thanks for any ideas or pointers! How about having 2 indexes: one on each of ip url_md5? Pg will combine the indexes as required, or will just use one if that is best. Why don't you have a time zone on your timestamp??? Regards, Gavin -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: How about having 2 indexes: one on each of ip url_md5? Pg will combine the indexes as required, or will just use one if that is best. Thanks Gavin. Question: what if I have a joined index? If from a joined index I only use the first column (say, ip) will a joined index still be used? It is cleaner to create two indexes for the two columns. Which is recommended? -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
Hi, On 21 November 2011 00:33, Phoenix Kiula phoenix.ki...@gmail.com wrote: Hi. Want to start another thread, loosely related to the performance problems thread I have going. Need some DB design guidance from the gurus here. My big table now has about 70 million rows, with the following columns: You can put different tables/indexes on different disks using tablespaces. For example, one tablespace for all tables and another one for all indexes (and change random_page_cost and seq_page_cost appropriately ie. lower random_page_cost). It is a good idea to put pg_xlog on the separate drive too. Sometimes Postgres just can't utilise all available resources properly and you can get results faster by running query over multiple connections. It could be worth to investigate pg-poolII's parallel query mode. You don't need multiple servers - just setup multiple PG instances on the same physical machine (up to one PG instance per core, with lower shared_mem, ...). Alternative could be parallel DB like Postgres-XC (http://wiki.postgresql.org/wiki/Postgres-XC), ... -- Ondrej Ivanic (ondrej.iva...@gmail.com) -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On Nov 20, 2011, at 20:50, Phoenix Kiula phoenix.ki...@gmail.com wrote: On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: How about having 2 indexes: one on each of ip url_md5? Pg will combine the indexes as required, or will just use one if that is best. Thanks Gavin. Question: what if I have a joined index? If from a joined index I only use the first column (say, ip) will a joined index still be used? It is cleaner to create two indexes for the two columns. Which is recommended? An index on (a, b) can be used for queries involving only a but not for those involving only b. Neither is recommended - both have their benefits and limitations. David J. -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general
Re: [GENERAL] Table Design question for gurus (without going to NoSQL)...
On 21/11/11 14:50, Phoenix Kiula wrote: On Mon, Nov 21, 2011 at 7:26 AM, Gavin Flower gavinflo...@archidevsys.co.nz wrote: How about having 2 indexes: one on each of ip url_md5? Pg will combine the indexes as required, or will just use one if that is best. Thanks Gavin. Question: what if I have a joined index? If from a joined index I only use the first column (say, ip) will a joined index still be used? It will use the joined index if the planner thinks it is worth it. However, it is harder for the planner to jusify using the index for the second field when the query does not restrict on the first field (I am assuming it can, if required). It is cleaner to create two indexes for the two columns. Which is recommended? If you are frequently just using one or other of the indexes and/or could effectively use a joined index in both directins - then szeparate indexes will probably be better. If your predominant query mode can use just use the one joined index effectively, then that would be better. Consider the amount of RAM the indexes and table data will take up. The advantages of indexing 2 fields separately compared to one =joined index are: that if you only need either single field index, it will take up less RAM and be also be quicker to read from disk. Plus the 2 single field indexes can be used together for queiries that use both fields. The costs are that when both indexes need to be used, there is a little bit more processing involved, and 2 single field indexes take up more RAM than a single joined index. So the answer is 'it depends...'! -- Sent via pgsql-general mailing list (pgsql-general@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general