[PERFORM] Guidelines on best indexing strategy for varying searches on 20+ columns
Hi, I’m running a search engine for cars. It’s backed by a postgresql 9.3 installation. Now I’m unsure about the best approach/strategy on doing index optimization for the fronted search. The problem: The table containing the cars holds a around 1,5 million rows. People that searches for cars needs different criteria to search by. Some search by brand/model, some by year, some by mileage, some by price and some by special equipment etc. etc. - and often they combine a whole bunch of criteria together. Of cause some, like brand/mode and price, are used more frequently than others. In total we offer: 9 category criteria like brand/model or body type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria like equipment. Lastly people can order the results by different columns (year, price, mileage and a score we create about the cars). By default we order by our own generated score. What I’ve done so far: I have analyzed the usage of the criteria “lightly”, and created a few indexes (10). Among those, are e.g. indexes on price, mileage and a combined index on brand/model. Since we are only interested in showing results for cars which is actually for sale, the indexes are made as partial indexes on a sales state column. Questions: 1. How would you go about analyzing and determining what columns should be indexed, and how? 2. What is the best strategy when optimizing indexes for searches happening on 20 + columns, where the use and the combinations varies a lot? (To just index everything, to index some of the columns, to do combined indexes, to only do single column indexes etc. etc.) 3. I expect that it does not make sense to index all columns? 4. I expect it does not make sense to index boolean columns? 5. Is it better to do a combined index on 5 frequently used columns rather than having individual indexes on each of them? 6. Would it be a goof idea to have all indexes sorted by my default sorting? 7. Do you have so experiences with other approaches that could greatly improve performance (e.g. forcing indexes to stay in memory etc.)? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] how to improve perf of 131MM row table?
Sorry for the semi-newbie question... I have a relatively sizable postgresql 9.0.2 DB with a few large tables (keep in mind "large" is relative, I'm sure there are plenty larger out there). One of my queries that seems to be bogging-down performance is a join between two tables on each of their BIGINT PK's (so they have default unique constraint/PK indexes on them). One table is a detail table for the other. The "master" has about 6mm rows. The detail table has about 131mm rows (table size = 17GB, index size = 16GB). I unfortunately have limited disks, so I can't actually move to multiple spindles, but wonder if there is anything I can do (should I partition the data, etc.) to improve performance? Maybe some further tuning to my .conf, but I do think that's using as much mem as I can spare right now (happy to send it along if it would help). DB is vacuumed nightly with stats updates enabled. I can send the statistics info listed in pgAdmin tab if that would help. Any suggestions, tips, tricks, links, etc. are welcomed! Thanks in advance, AJ -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to improve perf of 131MM row table?
On 06/25/2014 03:10 PM, AJ Weber wrote: I have a relatively sizable postgresql 9.0.2 DB with a few large tables (keep in mind "large" is relative, I'm sure there are plenty larger out there). Regardless of any help we might offer regarding this, you need to upgrade your installation to 9.0.17. You are behind by several performance, security, and integrity bugfixes, some of which address critical corruption bugs related to replication. One of my queries that seems to be bogging-down performance is a join between two tables on each of their BIGINT PK's (so they have default unique constraint/PK indexes on them). One table is a detail table for the other. This isn't enough information. Just knowing the relative sizes of the tables doesn't tell us which columns are indexed, whether or not the query is using those indexes, how many rows usually match, which queries are performing badly, and so on. Please refer to this page to ask performance related questions: https://wiki.postgresql.org/wiki/Slow_Query_Questions Without much of this information, we'd only be speculating. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to improve perf of 131MM row table?
I will gather the other data tonight. Thank you. In the meantime, I guess I wasn't clear about some other particulars The query's where clause is only an "IN", with a list of id's (those I mentioned are the PK), and the join is explicitly on the PK (so, indexed). Thus, there should be only the explicit matches to the in clause returned, and if postgresql isn't using the unique index on that column, I would be very shocked (to the point I would suggest there is a bug somewhere). An IN with 50 int values took 23sec to return (by way of example). Thanks again. -- Aaron On June 25, 2014 4:49:16 PM EDT, Shaun Thomas wrote: >On 06/25/2014 03:10 PM, AJ Weber wrote: > >> I have a relatively sizable postgresql 9.0.2 DB with a few large >tables >> (keep in mind "large" is relative, I'm sure there are plenty larger >out >> there). > >Regardless of any help we might offer regarding this, you need to >upgrade your installation to 9.0.17. You are behind by several >performance, security, and integrity bugfixes, some of which address >critical corruption bugs related to replication. > >> One of my queries that seems to be bogging-down performance is a join >> between two tables on each of their BIGINT PK's (so they have default >> unique constraint/PK indexes on them). One table is a detail table >for >> the other. > >This isn't enough information. Just knowing the relative sizes of the >tables doesn't tell us which columns are indexed, whether or not the >query is using those indexes, how many rows usually match, which >queries >are performing badly, and so on. > >Please refer to this page to ask performance related questions: > >https://wiki.postgresql.org/wiki/Slow_Query_Questions > >Without much of this information, we'd only be speculating. > >-- >Shaun Thomas >OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, >60604 >312-676-8870 >stho...@optionshouse.com > >__ > >See http://www.peak6.com/email_disclaimer/ for terms and conditions >related to this email
Re: [PERFORM] Guidelines on best indexing strategy for varying searches on 20+ columns
On Wed, Jun 25, 2014 at 3:48 AM, Niels Kristian Schjødt wrote: > Hi, > I’m running a search engine for cars. It’s backed by a postgresql 9.3 > installation. > > Now I’m unsure about the best approach/strategy on doing index optimization > for the fronted search. > > The problem: > > The table containing the cars holds a around 1,5 million rows. People that > searches for cars needs different criteria to search by. Some search by > brand/model, some by year, some by mileage, some by price and some by special > equipment etc. etc. - and often they combine a whole bunch of criteria > together. Of cause some, like brand/mode and price, are used more frequently > than others. In total we offer: 9 category criteria like brand/model or body > type, plus 5 numeric criteria like price or mileage, plus 12 boolean criteria > like equipment. Lastly people can order the results by different columns > (year, price, mileage and a score we create about the cars). By default we > order by our own generated score. > > What I’ve done so far: > > I have analyzed the usage of the criteria “lightly”, and created a few > indexes (10). Among those, are e.g. indexes on price, mileage and a combined > index on brand/model. Since we are only interested in showing results for > cars which is actually for sale, the indexes are made as partial indexes on a > sales state column. > > Questions: > > 1. How would you go about analyzing and determining what columns should be > indexed, and how? mainly frequency of access. > 2. What is the best strategy when optimizing indexes for searches happening > on 20 + columns, where the use and the combinations varies a lot? (To just > index everything, to index some of the columns, to do combined indexes, to > only do single column indexes etc. etc.) don't make 20 indexes. consider installing pg_trgm (for optimized LIKE searching) or hstore (for optmized key value searching) and then using GIST/GIN for multiple attribute search. with 9.4 we have another fancy technique to explore: jsonb searching via GIST/GIN. > 3. I expect that it does not make sense to index all columns? well, maybe. if you only ever search one column at a time, then it might make sense. but if you need to search arbitrary criteria and frequently combine a large number, then no -- particularly if your dataset is very large and individual criteria are not very selective. > 4. I expect it does not make sense to index boolean columns? in general, no. an important exception is if you are only interested in true or false and the number of records that have that interesting value is tiny relative to the size of the table. in that case, a partial index can be used for massive optimization. > 5. Is it better to do a combined index on 5 frequently used columns rather > than having individual indexes on each of them? Only if you search those 5 columns together a significant portion of the time. > 6. Would it be a goof idea to have all indexes sorted by my default sorting? index order rarely matters. if you always search values backwards and the table is very large you may want to consider it. unfortunately this often doesn't work for composite indexes so sometimes we must explore the old school technique of reversing the value. > 7. Do you have so experiences with other approaches that could greatly > improve performance (e.g. forcing indexes to stay in memory etc.)? as noted above, fancy indexing is the first place to look. start with pg_trgm (for like optmization), hstore, and the new json stuff. the big limitation you will hit is that that most index strategies, at least fo the prepackaged stuff will support '=', or partial string (particularly with pg_trgm like), but not > or <: for range operations you have to post process the search or try to work the index from another angle. merlin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to improve perf of 131MM row table?
On 06/25/2014 04:40 PM, Aaron Weber wrote: In the meantime, I guess I wasn't clear about some other particulars The query's where clause is only an "IN", with a list of id's (those I mentioned are the PK), and the join is explicitly on the PK (so, indexed). Indexed doesn't mean indexed if the wrong datatypes are used. We need to see the table and index definitions, and a sample query with EXPLAIN ANALYZE output. An IN with 50 int values took 23sec to return (by way of example). To me, this sounds like a sequence scan, or one of your key matches so many rows, the random seeks are throwing off your performance. Of course, I can't confirm that without EXPLAIN output. -- Shaun Thomas OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 stho...@optionshouse.com __ See http://www.peak6.com/email_disclaimer/ for terms and conditions related to this email -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] how to improve perf of 131MM row table?
Will get what you asked for ASAP. Thanks for your time. -- Aaron On June 25, 2014 5:55:29 PM EDT, Shaun Thomas wrote: >On 06/25/2014 04:40 PM, Aaron Weber wrote: > >> In the meantime, I guess I wasn't clear about some other particulars >> The query's where clause is only an "IN", with a list of id's (those >> I mentioned are the PK), and the join is explicitly on the PK (so, >> indexed). > >Indexed doesn't mean indexed if the wrong datatypes are used. We need >to >see the table and index definitions, and a sample query with EXPLAIN >ANALYZE output. > >> An IN with 50 int values took 23sec to return (by way of example). > >To me, this sounds like a sequence scan, or one of your key matches so >many rows, the random seeks are throwing off your performance. Of >course, I can't confirm that without EXPLAIN output. > >-- >Shaun Thomas >OptionsHouse, LLC | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, >60604 >312-676-8870 >stho...@optionshouse.com > >__ > >See http://www.peak6.com/email_disclaimer/ for terms and conditions >related to this email