Re: [PERFORM] high user cpu, massive SELECTs, no io waiting problem
Thomas Pöhler wrote: I remember you said you were using nginx and php-fastcgi, how many web server boxes do you have, and what are the specs ? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
[PERFORM] application of KNN code to US zipcode searches?
We perform over 1,000,000 searches each day for adoptable shelter pets near your zipcode. We already have adequate performance for these searches using the cube contrib, but the new KNN work in 9.1 seemed like it might be a promising way to speed this up even further. I installed PostgreSQL 9.1 on my laptop to evaluate it, using this post as a reference: http://www.depesz.com/index.php/2010/12/11/waiting-for-9-1-knngist/ The first task was to translate a geo-spatial search to use the new KNN syntax. I'm most familiar with two approaches to geo-spatial searching with PostgreSQL. The first is the older earthdistance approach, using point types and the @ operator. The second is the one we are using now, which uses a cube type, the cube_distance() and earth_box() method and a GIST index on the cube type. Immediately there is a hurdle in that KNN only appears to work with point types and the - operator, which does simple point-to-point distance, instead of the distance-around-the-earth. Still, I thought that could be enough of an approximation to test the waters. I started with some real world queries that involved some table joins, and when those failed to show improvement, I worked with some reduced-test-case queries. While I could confirm the new GIST index was being used on the point type, I couldn't get a query to benchmark better when it was invoked. I'm wondering if perhaps US zipcode searches aren't good use of this technology, perhaps because the data set is too small ( About 40,000 zipcodes ). Given that we can already do GIST-indexed searches with the cube type that provide good reasonable approximations for zipcode-radius searches, are others planning to eventually apply the KNN work to US zipcode searches? Sample EXPLAIN output and query times are below. Mark EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' AS radius FROM zipcodes ; --- Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.019..84.543 rows=41483 loops=1) Total runtime: 148.129 ms EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)'; -- Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1) Order By: (lon_lat - '(-118.412426,34.096629)'::point) Total runtime: 206.392 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
Mark Stosberg m...@summersault.com wrote: Sample EXPLAIN output and query times are below. Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.019..84.543 rows=41483 loops=1) Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.451..141.590 rows=41483 loops=1) I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. Kevin, Thanks for the feedback. You are right that my reduced test case wasn't a good approximation. I added a limit, to simulate finding the 100 zipcodes closest to 90210. Below I compare 4 approaches to the same query: 1. Cube search 2. Earth Distance Search 3. Simple point distance (no index) 4. Simple point distance (KNN) Now KNN benchmarks to be almost 100x faster! That's very promising. Then there's only the issue that simple point distance is not expected to be a good enough approximation of earth-distances. Perhaps that can be solved by pre-computing coordinates based on the lat/long pairs much like the map projections used to present a curved surface on a flat map? Given that's OK to be be a few miles off, it seems we have some leeway here. Recommendations? Mark EXPLAIN ANALYZE SELECT zipcode, cube_distance( '(-2513120.64361786, -4645511.0460328, 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius FROM zipcodes ORDER BY radius LIMIT 100; --- Limit (cost=2946.70..2946.95 rows=100 width=62) (actual time=167.650..168.064 rows=100 loops=1) - Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual time=167.644..167.829 rows=100 loops=1) Sort Key: ((cube_distance('(-2513120.64361786, -4645511.0460328, 3575538.9507084)'::cube, earth_coords) / 1609.344::double precision)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483 width=62) (actual time=0.030..90.807 rows=41483 loops=1) Total runtime: 168.300 ms 3 -- Using Earthdistance EXPLAIN ANALYZE SELECT zipcode, lon_lat @ '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat @ '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=187.995..188.451 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=187.989..188.149 rows=100 loops=1) Sort Key: ((lon_lat @ '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.033..108.203 rows=41483 loops=1) Total runtime: 188.660 ms ## Using simple point distance, but with no Gist Index: EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=160.574..161.057 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=160.568..160.691 rows=100 loops=1) Sort Key: ((lon_lat - '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.027..84.610 rows=41483 loops=1) Total runtime: 161.226 ms # -- Using KNN-GIST index EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; -- Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892 rows=100 loops=1) - Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1) Order By: (lon_lat - '(-118.412426,34.096629)'::point) Total runtime: 2.198 ms -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
* Mark Stosberg (m...@summersault.com) wrote: Recommendations? PostGIS, geometry columns, and UTM.. I'm not sure where they are wrt adding KNN support, but it's something they've been anxious to have for a while, so I expect support will come quickly. Thanks, Stephen signature.asc Description: Digital signature
Re: [PERFORM] application of KNN code to US zipcode searches?
PostGIS, geometry columns, and UTM.. I'm not sure where they are wrt adding KNN support, but it's something they've been anxious to have for a while, so I expect support will come quickly. I've looked into this a little more. One approach seems to be to project the lat/long pairs on to a flat plane using the Albers projection (which would be a one-time calculation), and then the current KNN point/distance calculations could be used. Here's a Perl module that references the Albers projection (although it's not yet clear to me how to use it): http://search.cpan.org/dist/PDL/ And a Wikipedia page on various calculation possibilities: http://en.wikipedia.org/wiki/Geographical_distance#Flat-surface_formulae Further suggestions welcome. Thanks, Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
On 17.02.2011 17:20, Mark Stosberg wrote: I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. Kevin, Thanks for the feedback. You are right that my reduced test case wasn't a good approximation. I added a limit, to simulate finding the 100 zipcodes closest to 90210. Below I compare 4 approaches to the same query: 1. Cube search 2. Earth Distance Search 3. Simple point distance (no index) 4. Simple point distance (KNN) Now KNN benchmarks to be almost 100x faster! That's very promising. Then there's only the issue that simple point distance is not expected to be a good enough approximation of earth-distances. Perhaps that can be solved by pre-computing coordinates based on the lat/long pairs much like the map projections used to present a curved surface on a flat map? Given that's OK to be be a few miles off, it seems we have some leeway here. Recommendations? The existing opclasses only support distance-to-a-point, but I believe the KNN gist code is flexible enough that it could be used for distance to the edge of a shape as well. Someone just needs to write the operators and support functions. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
I tried again to use KNN for a real-world query, and I was able to get it to add an approximately 6x speed-up vs the cube search or earthdistance methods ( from 300 ms to 50ms ). I had to make some notable changes for the KNN index to be considered. - Of course, I had to switch to using basic point/distance calculation. As previously noted, this still needs more work to confirm the accuracy and get the distance reported in miles. - The query planner didn't like it when the ORDER BY referred to a column value instead of a static value, even when I believe it should know that the column value never changes. See this pseudo-query where we look-up the coordinates for 90210 once: EXPLAIN ANALYZE SELECT pets.pet_id, zipcodes.lon_lat - center.lon_lat AS radius FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS center, pets JOIN shelters USING (shelter_id) JOIN zipcodes USING (zipcode) ORDER BY postal_codes.lon_lat - center.lon_lat limit 1000; This didn't use the KNN index until I changed the center.lon_lat in the ORDER BY to an explicit point value. I'm not sure if that's expected, or something I should take up with -hackers. This could be worked around by doing a initial query to look-up this value, and then feed a static value into this query. That's not ideal, but the combination would still be faster. - I had to drop the part of the WHERE clause which restricted the results to shelters within 50 miles from the target zipcode. However, I could set the LIMIT so high that I could get back enough pets, and then the application could trim out the results. Or, perhaps I could push this query down into a sub-select, and let PostgreSQL do a second pass to throw out some of the results. In any case, with a real-world speed-up of 6x, this looks like it will be worth it to us to continue to investigate. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
Heikki Linnakangas heikki.linnakan...@enterprisedb.com writes: The existing opclasses only support distance-to-a-point, but I believe the KNN gist code is flexible enough that it could be used for distance to the edge of a shape as well. Someone just needs to write the operators and support functions. The distance has to be exactly computable from the index entry, so you'd need to store the whole shape in the index, not just a bounding box. Not sure how practical that will be for complex shapes. regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
Mark Stosberg m...@summersault.com writes: - The query planner didn't like it when the ORDER BY referred to a column value instead of a static value, even when I believe it should know that the column value never changes. See this pseudo-query where we look-up the coordinates for 90210 once: EXPLAIN ANALYZE SELECT pets.pet_id, zipcodes.lon_lat - center.lon_lat AS radius FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS center, pets JOIN shelters USING (shelter_id) JOIN zipcodes USING (zipcode) ORDER BY postal_codes.lon_lat - center.lon_lat limit 1000; As phrased, that's a join condition, so there's no way that an index on a single table can possibly satisfy it. You could probably convert it to a sub-select though: ORDER BY postal_codes.lon_lat - (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000; regards, tom lane -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
Mark, we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could add KNN support. Oleg On Thu, 17 Feb 2011, Mark Stosberg wrote: I thought the benefit of KNN was that you could retrieve the rows in distance order, so that a query for the closest 20 locations (for example) would be very fast. I wouldn't have expected it to be helpful when you're selecting all the rows regardless of distance. Kevin, Thanks for the feedback. You are right that my reduced test case wasn't a good approximation. I added a limit, to simulate finding the 100 zipcodes closest to 90210. Below I compare 4 approaches to the same query: 1. Cube search 2. Earth Distance Search 3. Simple point distance (no index) 4. Simple point distance (KNN) Now KNN benchmarks to be almost 100x faster! That's very promising. Then there's only the issue that simple point distance is not expected to be a good enough approximation of earth-distances. Perhaps that can be solved by pre-computing coordinates based on the lat/long pairs much like the map projections used to present a curved surface on a flat map? Given that's OK to be be a few miles off, it seems we have some leeway here. Recommendations? Mark EXPLAIN ANALYZE SELECT zipcode, cube_distance( '(-2513120.64361786, -4645511.0460328, 3575538.9507084)', zipcodes.earth_coords)/1609.344 AS radius FROM zipcodes ORDER BY radius LIMIT 100; --- Limit (cost=2946.70..2946.95 rows=100 width=62) (actual time=167.650..168.064 rows=100 loops=1) - Sort (cost=2946.70..3050.40 rows=41483 width=62) (actual time=167.644..167.829 rows=100 loops=1) Sort Key: ((cube_distance('(-2513120.64361786, -4645511.0460328, 3575538.9507084)'::cube, earth_coords) / 1609.344::double precision)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1361.24 rows=41483 width=62) (actual time=0.030..90.807 rows=41483 loops=1) Total runtime: 168.300 ms 3 -- Using Earthdistance EXPLAIN ANALYZE SELECT zipcode, lon_lat @ '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat @ '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=187.995..188.451 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=187.989..188.149 rows=100 loops=1) Sort Key: ((lon_lat @ '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.033..108.203 rows=41483 loops=1) Total runtime: 188.660 ms ## Using simple point distance, but with no Gist Index: EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; Limit (cost=2842.99..2843.24 rows=100 width=22) (actual time=160.574..161.057 rows=100 loops=1) - Sort (cost=2842.99..2946.70 rows=41483 width=22) (actual time=160.568..160.691 rows=100 loops=1) Sort Key: ((lon_lat - '(-118.412426,34.096629)'::point)) Sort Method: top-N heapsort Memory: 20kB - Seq Scan on zipcodes (cost=0.00..1257.54 rows=41483 width=22) (actual time=0.027..84.610 rows=41483 loops=1) Total runtime: 161.226 ms # -- Using KNN-GIST index EXPLAIN ANALYZE SELECT zipcode, lon_lat - '(-118.412426,34.096629)' As radius FROM zipcodes ORDER BY lon_lat - '(-118.412426,34.096629)' LIMIT 100; -- Limit (cost=0.00..12.94 rows=100 width=22) (actual time=0.447..1.892 rows=100 loops=1) - Index Scan using zipcodes_knn on zipcodes (cost=0.00..5365.93 rows=41483 width=22) (actual time=0.440..1.407 rows=100 loops=1) Order By: (lon_lat - '(-118.412426,34.096629)'::point) Total runtime: 2.198 ms Regards, Oleg _ Oleg Bartunov, Research Scientist, Head of AstroNet (www.astronet.ru), Sternberg Astronomical Institute, Moscow University, Russia Internet: o...@sai.msu.su, http://www.sai.msu.su/~megera/ phone: +007(495)939-16-83, +007(495)939-23-83 -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
On 02/17/2011 03:17 PM, Oleg Bartunov wrote: Mark, we investigating pgsphere http://pgsphere.projects.postgresql.org/, if we could add KNN support. Great, thanks Oleg. I'll be happy to test it when something is ready. Mark -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] application of KNN code to US zipcode searches?
On Thu, Feb 17, 2011 at 11:17 AM, Tom Lane t...@sss.pgh.pa.us wrote: Mark Stosberg m...@summersault.com writes: - The query planner didn't like it when the ORDER BY referred to a column value instead of a static value, even when I believe it should know that the column value never changes. See this pseudo-query where we look-up the coordinates for 90210 once: EXPLAIN ANALYZE SELECT pets.pet_id, zipcodes.lon_lat - center.lon_lat AS radius FROM (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') AS center, pets JOIN shelters USING (shelter_id) JOIN zipcodes USING (zipcode) ORDER BY postal_codes.lon_lat - center.lon_lat limit 1000; As phrased, that's a join condition, so there's no way that an index on a single table can possibly satisfy it. You could probably convert it to a sub-select though: ORDER BY postal_codes.lon_lat - (SELECT lon_lat FROM zipcodes WHERE zipcode = '90210') limit 1000; regards, tom lane Would pushing that subquery to a WITH clause be helpful at all? -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] high user cpu, massive SELECTs, no io waiting problem
I think adding UNION ALL SELECT 'postgres version', version(); might be a good thing. On Wed, Feb 16, 2011 at 9:55 AM, Greg Smith g...@2ndquadrant.com wrote: Kevin Grittner wrote: In fact, I wonder whether we shouldn't leave a couple items you've excluded, since they are sometimes germane to problems posted, like lc_collate and TimeZone. I pulled some of them out only because they're not really postgresql.conf settings; lc_collate and lc_ctype for example are set at initdb time. Feel free to hack on that example if you feel it could be improved, just be aware which of those things are not really in the main config file when pondering if they should be included. -- Greg Smith 2ndQuadrant US g...@2ndquadrant.com Baltimore, MD PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.us PostgreSQL 9.0 High Performance: http://www.2ndQuadrant.com/books -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] high user cpu, massive SELECTs, no io waiting problem
Scott, are you really moving that much data through memory, 70-80GB/sec is the limit of the new intel 7500 series in a best case scenario. - John -Original Message- From: pgsql-performance-ow...@postgresql.org [mailto:pgsql-performance-ow...@postgresql.org] On Behalf Of Scott Marlowe Sent: 16 February 2011 15:43 To: Marti Raudsepp Cc: Thomas Pöhler; pgsql-performance@postgresql.org; Felix Feinhals; Verteiler_A-Team; Björn Metzdorf Subject: Re: [PERFORM] high user cpu, massive SELECTs, no io waiting problem On Wed, Feb 16, 2011 at 6:44 AM, Marti Raudsepp ma...@juffo.org wrote: On Tue, Feb 15, 2011 at 20:01, Scott Marlowe scott.marl...@gmail.com wrote: run htop and look for red. if youi've got lots of red bar on each CPU but no io wait then it's waiting for memory access. I don't think this is true. AFAICT the red bar refers to system time, time that's spent in the kernel -- either in syscalls or kernel background threads. My point being that if you've got a lot of RED it'll be the OS waiting for memory access. Trust me, when we start to hit our memory bandwidth (in the 70 to 80 GB/s range) we start to get more and more red and more and more kernel wait time. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. All market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of JPMorgan Chase Co., its subsidiaries and affiliates. This transmission may contain information that is privileged, confidential, legally privileged, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution, or use of the information contained herein (including any reliance thereon) is STRICTLY PROHIBITED. Although this transmission and any attachments are believed to be free of any virus or other defect that might affect any computer system into which it is received and opened, it is the responsibility of the recipient to ensure that it is virus free and no responsibility is accepted by JPMorgan Chase Co., its subsidiaries and affiliates, as applicable, for any loss or damage arising in any way from its use. If you received this transmission in error, please immediately contact the sender and destroy the material in its entirety, whether in electronic or hard copy format. Thank you. Please refer to http://www.jpmorgan.com/pages/disclosures for disclosures relating to European legal entities. -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance