Re: [PERFORM] Slow query with 3 table joins
This looks like the same optimizer problem that occasionally plagues our customers. Whenever the estimated rows of a join==1, but the actual rows is higher, the optimizer may choose very poor plans. I made some attempts to fix. The very simple fix is to never estimate 1 for a join result. Even using 2 works remarkably well as a defense against this problem. https://github.com/labkey-matthewb/postgres/commit/b1fd99f4deffbbf3db2172ccaba51a34f18d1b1a I also made a much more correct but complicated patch to track both uniqueness and selectivity thought the optimizer, but I didn't quite push that over the finish line (I made a mistake in the hash join code, and got distracted by my day job before finishing it). https://github.com/labkey-matthewb/postgres/commits/struct_selectivity The second path is certainly better approach, but needs someone to pick up the mission. Matt On Wed, Apr 26, 2017 at 8:00 AM, Gerardo Herzigwrote: > Some other approaches you could try: > > 1) What about an hashed index? You could make > CREATE INDEX ON FIELD (unit_id, hashtext(field_name)) > > and changing your query accordingly > > "where hashtext(FIELD.FIELD_NAME)=hashtext('SHEETS_PRESENT') " > > 2) Partitioning (not native yet, but can be simulated through > inheritance), like in > https://www.postgresql.org/docs/current/static/ddl-partitioning.html > This could work well if you have a sort of limited different values in > FIELD.FIELD_NAME > > Gerardo > > - Mensaje original - > > De: "Alessandro Ferrucci" > > Para: pgsql-performance@postgresql.org > > Enviados: Miércoles, 26 de Abril 2017 0:19:37 > > Asunto: Re: [PERFORM] Slow query with 3 table joins > > > > > > > > After about 40 inutes the slow query finally finished and the result > > of the EXPLAIN plan can be found here: > > > > > > https://explain.depesz.com/s/BX22 > > > > > > Thanks, > > Alessandro Ferrucci > > > > > > On Tue, Apr 25, 2017 at 11:10 PM, Alessandro Ferrucci < > > alessandroferru...@gmail.com > wrote: > > > > > > > > > > Hello - I am migrating a current system to PostgreSQL and I am having > > an issue with a relatively straightforward query being extremely > > slow. > > > > > > The following are the definitions of the tables: > > > > > > CREATE TABLE popt_2017.unit > > ( > > id serial NOT NULL, > > unit_id text, > > batch_id text, > > create_date timestamp without time zone DEFAULT now(), > > update_date timestamp without time zone, > > CONSTRAINT unit_pkey PRIMARY KEY (id) > > ) > > WITH ( > > OIDS=FALSE > > ); > > > > > > CREATE TABLE popt_2017.field > > ( > > id serial NOT NULL, > > unit_id integer, > > subunit_data_id integer, > > field_name character varying(50), > > page_id character varying(20), > > page_type character varying(20), > > batch_id character varying(20), > > file_name character varying(20), > > data_concept integer, > > "GROUP" integer, > > omr_group integer, > > pres integer, > > reg_data text, > > ocr_conf text, > > ocr_dict text, > > ocr_phon text, > > create_date timestamp without time zone DEFAULT now(), > > update_date timestamp without time zone, > > CONSTRAINT field_pkey PRIMARY KEY (id), > > CONSTRAINT field_subunit_data_id_fkey FOREIGN KEY (subunit_data_id) > > REFERENCES popt_2017.subunit (id) MATCH SIMPLE > > ON UPDATE NO ACTION ON DELETE NO ACTION, > > CONSTRAINT field_unit_id_fk FOREIGN KEY (unit_id) > > REFERENCES popt_2017.unit (id) MATCH FULL > > ON UPDATE NO ACTION ON DELETE NO ACTION, > > CONSTRAINT field_unit_id_fkey FOREIGN KEY (unit_id) > > REFERENCES popt_2017.unit (id) MATCH SIMPLE > > ON UPDATE NO ACTION ON DELETE NO ACTION > > ) > > WITH ( > > OIDS=FALSE > > ); > > > > > > CREATE TABLE popt_2017.answer > > ( > > id serial NOT NULL, > > field_id integer, > > ans_status integer, > > ans text, > > luggage text, > > arec text, > > kfi_partition integer, > > final boolean, > > length integer, > > create_date timestamp without time zone DEFAULT now(), > > update_date timestamp without time zone, > > CONSTRAINT answer_pkey PRIMARY KEY (id), > > CONSTRAINT answer_field_id_fk FOREIGN KEY (field_id) > > REFERENCES popt_2017.field (id) MATCH FULL > > ON UPDATE NO ACTION ON DELETE NO ACTION, > > CONSTRAINT answer_field_id_fkey FOREIGN KEY (field_id) > > REFERENCES popt_2017.field (id) MATCH SIMPLE > > ON UPDATE NO ACTION ON DELETE NO ACTION > > ) > > WITH ( > > OIDS=FALSE > > ); > > > > > > Below are the index definitions for those tables: > > > > > > UNIT: > > CREATE UNIQUE INDEX unit_pkey ON unit USING btree (id); > > CREATE INDEX unit_unit_id_idx ON unit USING btree (unit_id); > > > > > > FIELD: > > CREATE UNIQUE INDEX field_pkey ON field USING btree (id) > > CREATE INDEX field_unit_id_idx ON field USING btree (unit_id) > > CREATE INDEX field_subunit_id_idx ON field USING btree > > (subunit_data_id) > > CREATE INDEX field_field_name_idx ON field USING btree (field_name) > > > > > > ANSWER: > > CREATE UNIQUE INDEX answer_pkey ON answer
Re: [PERFORM] Slow query with 3 table joins
Dave - I had re-ran ANALYZE and VACUUM on all the tables and that did not seem to have fixed the issue (the query still took a long time, however I did not let it finish to produce a full EXPLAIN plan. However - after creating an index on FIELD(FIELD_NAME,UNIT_ID) and now the query runs very fast ( I changed the FIELD_NAME clause so I would not run into any caching ). The new query is (notice the new FIELD_NAME value to avoid caching). EXPLAIN (ANALYZE,BUFFERS) SELECT UNIT.ID AS UNIT_ID, UNIT.UNIT_ID AS UNIT_UNIT_ID, UNIT.BATCH_ID AS UNIT_BATCH_ID, UNIT.CREATE_DATE AS UNIT_CREATE_DATE, UNIT.UPDATE_DATE AS UNIT_UPDATE_DATE FROM UNIT, FIELD, ANSWER WHERE UNIT.ID=FIELD.UNIT_ID AND FIELD.ID=ANSWER.FIELD_ID AND FIELD.FIELD_NAME='RESP_PH_PREFIX_ID' AND ANSWER.ANS='2'; You can find the EXPLAIN plan here: https://explain.depesz.com/s/apYR I believe this fixes the issue as far as I can see. I'm going to play around with it more and see how it goes. I wanted to also answer your question as to how many different values there are for the FIELD_NAME (and how many rows of each value there are) here is it: SELECT FIELD_NAME,COUNT(*) FROM FIELD GROUP BY FIELD_NAME ORDER BY COUNT; "PAGE_SERIAL";10315 "SHEETS_PRESENT";10315 "RESP_PH_AREA_ID";10556 "RESP_PH_PREFIX_ID";10559 "RESP_PH_SUFFIX_ID";10560 "H_LOC_ADD_NO_IND";10587 "H_TENURE_RENTED_IND";11162 "H_TENURE_OWNED_MORT_IND";11199 "H_TENURE_OWNED_FREE_IND";11208 "PAPER_JIC_2_TEXT";11413 "PAPER_JIC_1_TEXT";11413 "H_TENURE_OCC_NOPAY_IND";11478 "H_LOC_CHILDREN_IND";11496 "H_LOC_RELATIVES_IND";11496 "H_LOC_TEMPORARY_IND";11500 "H_LOC_NONRELATIVES_IND";11510 "PSEUDO_FIELD_MARGINALIA";87744 "H_SIZE_STATED_INT";207918 "P_REL_NO_IND";825240 "P_REL_YES_IND";825240 "P_REL_OTHER_NONREL_IND";1239894 "P_REL_CHILD_ADOPTED_IND";1239894 "P_REL_CHILD_BIO_IND";1239894 "P_REL_CHILD_FOSTER_IND";1239894 "P_REL_CHILD_STEP_IND";1239894 "P_REL_GRANDCHILD_IND";1239894 "P_REL_HOUSEROOMMATE_IND";1239894 "P_REL_INLAW_CHILD_IND";1239894 "P_REL_INLAW_PARENT_IND";1239894 "P_REL_OTHER_REL_IND";1239894 "P_REL_PARENT_IND";1239894 "P_REL_PARTNER_OPP_IND";1239894 "P_REL_PARTNER_SAME_IND";1239894 "P_REL_SIBLING_IND";1239894 "P_REL_SPOUSE_OPP_IND";1239894 "P_REL_SPOUSE_SAME_IND";1239894 "P_TRBSHR_CORP_NAME";1446204 "P_TRBSHR_YES_IND";1446204 "P_TRBSHR_NO_IND";1446204 "P_LOC_ELSE_COLLEGE_IND";1446204 "P_LOC_ELSE_JAIL_IND";1446204 "P_LOC_ELSE_JOB_IND";1446204 "P_LOC_ELSE_MILITARY_IND";1446204 "P_LOC_ELSE_NO_IND";1446204 "P_TRBENR_YES_IND";1446204 "P_LOC_ELSE_SEASONAL_IND";1446204 "P_LOC_ELSE_NURSINGHOME_IND";1446204 "P_TRBENR_TRIBE_NAME";1446204 "P_TRBENR_NO_IND";1446204 "P_LOC_ELSE_RELATIVES_IND";1446204 "P_LOC_ELSE_OTHER_IND";1446204 "P_RACE_WHITE_IND";1447812 "P_RACE2_TONGAN_IND";1447812 "P_RACE2_AFAM_IND";1447812 "P_RACE2_AIAN_TEXT";1447812 "P_RACE2_ASIANINDIAN_IND";1447812 "P_RACE2_ASIAN_TEXT";1447812 "P_RACE2_BLACK_TEXT";1447812 "P_RACE2_CHAMORRO_IND";1447812 "P_RACE2_CHINESE_IND";1447812 "P_RACE2_COLOMBIAN_IND";1447812 "P_RACE2_CUBAN_IND";1447812 "P_RACE2_DOMINICAN_IND";1447812 "P_RACE2_EGYPTIAN_IND";1447812 "P_RACE2_ENGLISH_IND";1447812 "P_RACE2_ETHIOPIAN_IND";1447812 "P_RACE2_FIJIAN_IND";1447812 "P_RACE2_FILIPINO_IND";1447812 "P_RACE2_FRENCH_IND";1447812 "P_RACE2_GERMAN_IND";1447812 "P_RACE2_HAITIAN_IND";1447812 "P_RACE2_HISP_TEXT";1447812 "P_RACE2_IRANIAN_IND";1447812 "P_RACE2_IRISH_IND";1447812 "P_RACE2_ISRAELI_IND";1447812 "P_RACE2_ITALIAN_IND";1447812 "P_RACE2_JAMAICAN_IND";1447812 "P_RACE2_JAPANESE_IND";1447812 "P_RACE2_KOREAN_IND";1447812 "P_RACE2_LEBANESE_IND";1447812 "P_RACE2_MARSHALLESE_IND";1447812 "P_RACE2_MENA_TEXT";1447812 "P_RACE2_MEXICAN_IND";1447812 "P_RACE2_MOROCCAN_IND";1447812 "P_RACE2_NATHAWAIIAN_IND";1447812 "P_RACE2_NHPI_TEXT";1447812 "P_RACE2_NIGERIAN_IND";1447812 "P_RACE2_POLISH_IND";1447812 "P_RACE2_PUERTORICAN_IND";1447812 "P_RACE2_SALVADORAN_IND";1447812 "P_RACE2_SAMOAN_IND";1447812 "P_RACE2_SOMALI_IND";1447812 "P_RACE2_SOR_TEXT";1447812 "P_RACE2_SYRIAN_IND";1447812 "P_RACE2_VIETNAMESE_IND";1447812 "P_RACE2_WHITE_TEXT";1447812 "P_RACE_AIAN_IND";1447812 "P_RACE_ASIAN_IND";1447812 "P_RACE_BLACK_IND";1447812 "P_RACE_HISP_IND";1447812 "P_RACE_MENA_IND";1447812 "P_RACE_NHPI_IND";1447812 "P_RACE_SOR_IND";1447812 "P_SEX_MALE_IND";2273052 "P_SEX_FEMALE_IND";2273052 "P_MIDDLE_NAME";2273052 "P_LAST_NAME";2273052 "P_FIRST_NAME";2273052 "P_BIRTH_YEAR_INT";2273052 "P_BIRTH_MONTH_INT";2273052 "P_BIRTH_DAY_INT";2273052 "P_AGE_INT";2273052 I want to give a HUGE thanks to everyone who took the time to look at my issue and provide insight and assistance, you folks are truly awesome! On Wed, Apr 26, 2017 at 12:12 AM, David Rowleywrote: > On 26 April 2017 at 15:19, Alessandro Ferrucci > wrote: > > After about 40 inutes the slow query finally finished and the result of > the > > EXPLAIN plan can be found here: > > > > https://explain.depesz.com/s/BX22 > > > Index Scan using
Re: [PERFORM] Slow query with 3 table joins
Some other approaches you could try: 1) What about an hashed index? You could make CREATE INDEX ON FIELD (unit_id, hashtext(field_name)) and changing your query accordingly "where hashtext(FIELD.FIELD_NAME)=hashtext('SHEETS_PRESENT') " 2) Partitioning (not native yet, but can be simulated through inheritance), like in https://www.postgresql.org/docs/current/static/ddl-partitioning.html This could work well if you have a sort of limited different values in FIELD.FIELD_NAME Gerardo - Mensaje original - > De: "Alessandro Ferrucci"> Para: pgsql-performance@postgresql.org > Enviados: Miércoles, 26 de Abril 2017 0:19:37 > Asunto: Re: [PERFORM] Slow query with 3 table joins > > > > After about 40 inutes the slow query finally finished and the result > of the EXPLAIN plan can be found here: > > > https://explain.depesz.com/s/BX22 > > > Thanks, > Alessandro Ferrucci > > > On Tue, Apr 25, 2017 at 11:10 PM, Alessandro Ferrucci < > alessandroferru...@gmail.com > wrote: > > > > > Hello - I am migrating a current system to PostgreSQL and I am having > an issue with a relatively straightforward query being extremely > slow. > > > The following are the definitions of the tables: > > > CREATE TABLE popt_2017.unit > ( > id serial NOT NULL, > unit_id text, > batch_id text, > create_date timestamp without time zone DEFAULT now(), > update_date timestamp without time zone, > CONSTRAINT unit_pkey PRIMARY KEY (id) > ) > WITH ( > OIDS=FALSE > ); > > > CREATE TABLE popt_2017.field > ( > id serial NOT NULL, > unit_id integer, > subunit_data_id integer, > field_name character varying(50), > page_id character varying(20), > page_type character varying(20), > batch_id character varying(20), > file_name character varying(20), > data_concept integer, > "GROUP" integer, > omr_group integer, > pres integer, > reg_data text, > ocr_conf text, > ocr_dict text, > ocr_phon text, > create_date timestamp without time zone DEFAULT now(), > update_date timestamp without time zone, > CONSTRAINT field_pkey PRIMARY KEY (id), > CONSTRAINT field_subunit_data_id_fkey FOREIGN KEY (subunit_data_id) > REFERENCES popt_2017.subunit (id) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT field_unit_id_fk FOREIGN KEY (unit_id) > REFERENCES popt_2017.unit (id) MATCH FULL > ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT field_unit_id_fkey FOREIGN KEY (unit_id) > REFERENCES popt_2017.unit (id) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION > ) > WITH ( > OIDS=FALSE > ); > > > CREATE TABLE popt_2017.answer > ( > id serial NOT NULL, > field_id integer, > ans_status integer, > ans text, > luggage text, > arec text, > kfi_partition integer, > final boolean, > length integer, > create_date timestamp without time zone DEFAULT now(), > update_date timestamp without time zone, > CONSTRAINT answer_pkey PRIMARY KEY (id), > CONSTRAINT answer_field_id_fk FOREIGN KEY (field_id) > REFERENCES popt_2017.field (id) MATCH FULL > ON UPDATE NO ACTION ON DELETE NO ACTION, > CONSTRAINT answer_field_id_fkey FOREIGN KEY (field_id) > REFERENCES popt_2017.field (id) MATCH SIMPLE > ON UPDATE NO ACTION ON DELETE NO ACTION > ) > WITH ( > OIDS=FALSE > ); > > > Below are the index definitions for those tables: > > > UNIT: > CREATE UNIQUE INDEX unit_pkey ON unit USING btree (id); > CREATE INDEX unit_unit_id_idx ON unit USING btree (unit_id); > > > FIELD: > CREATE UNIQUE INDEX field_pkey ON field USING btree (id) > CREATE INDEX field_unit_id_idx ON field USING btree (unit_id) > CREATE INDEX field_subunit_id_idx ON field USING btree > (subunit_data_id) > CREATE INDEX field_field_name_idx ON field USING btree (field_name) > > > ANSWER: > CREATE UNIQUE INDEX answer_pkey ON answer USING btree (id) > CREATE INDEX answer_field_id_idx ON answer USING btree (field_id) > CREATE INDEX answer_ans_idx ON answer USING btree (ans) > > > The tables each have the following number of rows: > > > UNIT: 10,315 > FIELD: 139,397,965 > ANSWER: 3,463,300 > > > The query in question is: > > > SELECT > UNIT.ID AS UNIT_ID, > UNIT.UNIT_ID AS UNIT_UNIT_ID, > UNIT.BATCH_ID AS UNIT_BATCH_ID, > UNIT.CREATE_DATE AS UNIT_CREATE_DATE, > UNIT.UPDATE_DATE AS UNIT_UPDATE_DATE > FROM > UNIT, FIELD, ANSWER > WHERE > UNIT.ID =FIELD.UNIT_ID AND > FIELD.ID =ANSWER.FIELD_ID AND > FIELD.FIELD_NAME='SHEETS_PRESENT' AND > ANSWER.ANS='2'; > > > I attempted to run an EXPLAIN (ANALYZE,BUFFERS) and the query has > been running for 32 minutes now, So I won't be able to post the > results (as I've never been able to get the query to actually > finish. > > > But, if I remove the join to UNIT (and just join FIELD and ANSWER) > the resulting query is sufficiently fast, (the first time it ran in > roughly 3 seconds), the query as such is: > > > SELECT * FROM > ANSWER, FIELD > WHERE > FIELD.ID =ANSWER.FIELD_ID AND > FIELD.FIELD_NAME='SHEETS_PRESENT' AND > ANSWER.ANS='2'; > > > The EXPLAIN ( ANALYZE, BUFFERS )
Re: [PERFORM] Slow query with 3 table joins
- Mensaje original - > De: "Alessandro Ferrucci"> Para: pgsql-performance@postgresql.org > Enviados: Miércoles, 26 de Abril 2017 0:19:37 > Asunto: Re: [PERFORM] Slow query with 3 table joins > > > > After about 40 inutes the slow query finally finished and the result > of the EXPLAIN plan can be found here: > > > https://explain.depesz.com/s/BX22 > > > Thanks, > Alessandro Ferrucci 1) Looking at the "Rows removed by filter" in that explain, looks like a selectivity issue: Many (many many) rows are fetched, just to be rejected later. I think you can try a partial index on ''field (unit_id) where field_name="SHEETS_PRESENT"'', if it is practical to you. See https://www.postgresql.org/docs/current/static/indexes-partial.html for a good read about partial indexes. 2) 9.2 is a pretty old version of PG. If you are migrating yet, you should consider a more recent version HTH Gerardo -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with 3 table joins
> Hi Eskil - > > > The I believe the id-field you're referring to is the UNIT.UNIT_ID, I > could change this to a varchar, however that column is not used in the > query in question, so that wouldn't have any effect on the query's > performance. Sorry, I did not notice that the column "unit_id" existed in both "unit" and "field" tables. / Eskil -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance
Re: [PERFORM] Slow query with 3 table joins
Hi Dave - thank you very much for all this advice! I will try each of these and post back results (some of this stuff, like creating the index, which is happening now, takes a very long time). Thanks again for all these pointers. Cheers, Alessandro On Wed, Apr 26, 2017 at 12:12 AM, David Rowleywrote: > On 26 April 2017 at 15:19, Alessandro Ferrucci > wrote: > > After about 40 inutes the slow query finally finished and the result of > the > > EXPLAIN plan can be found here: > > > > https://explain.depesz.com/s/BX22 > > > Index Scan using field_unit_id_idx on field (cost=0.00..8746678.52 > rows=850149 width=8) (actual time=0.030..2414345.998 rows=10315 loops=1)" > > This estimate seems a long way off. Are the stats up-to-date on the > table? Try again after running: ANALYZE field; > > It might also be a good idea to ANALYZE all the tables. Is auto-vacuum > switched on? > > The plan in question would work better if you create an index on field > (field_name, unit_id); > > but I think if you update the stats the plan will switch. > > A HashJoin, hashing "unit" and index scanning on field_field_name_idx > would have been a much smarter plan choice for the planner to make. > > Also how many distinct field_names are there? SELECT COUNT(DISTINCT > field_name) FROM field; > > You may want to increase the histogram buckets on that columns if > there are more than 100 field names, and the number of rows with each > field name is highly variable. ALTER TABLE field ALTER COLUMN > field_name SET STATISTICS ; 100 is the default, and 1 > is the maximum. > > > -- > David Rowley http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > -- Signed, Alessandro Ferrucci
Re: [PERFORM] Slow query with 3 table joins
Hi Eskil - The I believe the id-field you're referring to is the UNIT.UNIT_ID, I could change this to a varchar, however that column is not used in the query in question, so that wouldn't have any effect on the query's performance. Just for curiosity - I have changed the ANSWER.ANS datatype to a varchar(250), but that did not affect the performance of the query. On Wed, Apr 26, 2017 at 2:24 AM, Johan Fredrikssonwrote: > tis 2017-04-25 klockan 23:19 -0400 skrev Alessandro Ferrucci: > > After about 40 inutes the slow query finally finished and the result > > of the EXPLAIN plan can be found here: > > > > > > https://explain.depesz.com/s/BX22 > > > > > > Thanks, > > Alessandro Ferrucci > > I'm not so familiar with the index implementetion in Postgres, but I > don't think it is very efficient to index a text-field. It also loooks a > bit strange that a id-field has the datatype "text" rather than integer > or varchar. > > / Eskil > > > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > -- Signed, Alessandro Ferrucci
Re: [PERFORM] Slow query with 3 table joins
tis 2017-04-25 klockan 23:19 -0400 skrev Alessandro Ferrucci: > After about 40 inutes the slow query finally finished and the result > of the EXPLAIN plan can be found here: > > > https://explain.depesz.com/s/BX22 > > > Thanks, > Alessandro Ferrucci I'm not so familiar with the index implementetion in Postgres, but I don't think it is very efficient to index a text-field. It also loooks a bit strange that a id-field has the datatype "text" rather than integer or varchar. / Eskil -- Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance