Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Matthew Bellew
This looks like the same optimizer problem that occasionally plagues our
customers.  Whenever the estimated rows of a join==1, but the actual rows
is higher, the optimizer may choose very poor plans.  I made some attempts
to fix.  The very simple fix is to never estimate 1 for a join result.
Even using 2 works remarkably well as a defense against this problem.


https://github.com/labkey-matthewb/postgres/commit/b1fd99f4deffbbf3db2172ccaba51a34f18d1b1a

I also made a much more correct but complicated patch to track both
uniqueness and selectivity thought the optimizer, but I didn't quite push
that over the finish line (I made a mistake in the hash join code, and got
distracted by my day job before finishing it).

  https://github.com/labkey-matthewb/postgres/commits/struct_selectivity

The second path is certainly better approach, but needs someone to pick up
the mission.

Matt

On Wed, Apr 26, 2017 at 8:00 AM, Gerardo Herzig <gher...@fmed.uba.ar> wrote:

> Some other approaches you could try:
>
> 1) What about an hashed index? You could make
> CREATE INDEX ON FIELD (unit_id, hashtext(field_name))
>
> and changing your query accordingly
>
> "where hashtext(FIELD.FIELD_NAME)=hashtext('SHEETS_PRESENT') "
>
> 2) Partitioning (not native yet, but can be simulated through
> inheritance), like in
> https://www.postgresql.org/docs/current/static/ddl-partitioning.html
> This could work well if you have a sort of limited different values in
> FIELD.FIELD_NAME
>
> Gerardo
>
> - Mensaje original -
> > De: "Alessandro Ferrucci" <alessandroferru...@gmail.com>
> > Para: pgsql-performance@postgresql.org
> > Enviados: Miércoles, 26 de Abril 2017 0:19:37
> > Asunto: Re: [PERFORM] Slow query with 3 table joins
> >
> >
> >
> > After about 40 inutes the slow query finally finished and the result
> > of the EXPLAIN plan can be found here:
> >
> >
> > https://explain.depesz.com/s/BX22
> >
> >
> > Thanks,
> > Alessandro Ferrucci
> >
> >
> > On Tue, Apr 25, 2017 at 11:10 PM, Alessandro Ferrucci <
> > alessandroferru...@gmail.com > wrote:
> >
> >
> >
> >
> > Hello - I am migrating a current system to PostgreSQL and I am having
> > an issue with a relatively straightforward query being extremely
> > slow.
> >
> >
> > The following are the definitions of the tables:
> >
> >
> > CREATE TABLE popt_2017.unit
> > (
> > id serial NOT NULL,
> > unit_id text,
> > batch_id text,
> > create_date timestamp without time zone DEFAULT now(),
> > update_date timestamp without time zone,
> > CONSTRAINT unit_pkey PRIMARY KEY (id)
> > )
> > WITH (
> > OIDS=FALSE
> > );
> >
> >
> > CREATE TABLE popt_2017.field
> > (
> > id serial NOT NULL,
> > unit_id integer,
> > subunit_data_id integer,
> > field_name character varying(50),
> > page_id character varying(20),
> > page_type character varying(20),
> > batch_id character varying(20),
> > file_name character varying(20),
> > data_concept integer,
> > "GROUP" integer,
> > omr_group integer,
> > pres integer,
> > reg_data text,
> > ocr_conf text,
> > ocr_dict text,
> > ocr_phon text,
> > create_date timestamp without time zone DEFAULT now(),
> > update_date timestamp without time zone,
> > CONSTRAINT field_pkey PRIMARY KEY (id),
> > CONSTRAINT field_subunit_data_id_fkey FOREIGN KEY (subunit_data_id)
> > REFERENCES popt_2017.subunit (id) MATCH SIMPLE
> > ON UPDATE NO ACTION ON DELETE NO ACTION,
> > CONSTRAINT field_unit_id_fk FOREIGN KEY (unit_id)
> > REFERENCES popt_2017.unit (id) MATCH FULL
> > ON UPDATE NO ACTION ON DELETE NO ACTION,
> > CONSTRAINT field_unit_id_fkey FOREIGN KEY (unit_id)
> > REFERENCES popt_2017.unit (id) MATCH SIMPLE
> > ON UPDATE NO ACTION ON DELETE NO ACTION
> > )
> > WITH (
> > OIDS=FALSE
> > );
> >
> >
> > CREATE TABLE popt_2017.answer
> > (
> > id serial NOT NULL,
> > field_id integer,
> > ans_status integer,
> > ans text,
> > luggage text,
> > arec text,
> > kfi_partition integer,
> > final boolean,
> > length integer,
> > create_date timestamp without time zone DEFAULT now(),
> > update_date timestamp without time zone,
> > CONSTRAINT answer_pkey PRIMARY KEY (id),
> > CONSTRAINT answer_field_id_fk FOREIGN KEY (field_id)
> > REFERENCES popt_2017.field (id) MATCH FULL
> > ON UPDATE NO ACTION ON DELETE NO ACTION,
> > CONSTRAINT answer_field_id_fkey FOREIGN KEY (field_id)

Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Alessandro Ferrucci
Dave -

I had re-ran ANALYZE and VACUUM on all the tables and that did not seem to
have fixed the issue (the query still took a long time, however I did not
let it finish to produce a full EXPLAIN plan.

However - after creating an index on FIELD(FIELD_NAME,UNIT_ID) and now the
query runs very fast ( I changed the FIELD_NAME clause so I would not run
into
any caching ).

The new query is (notice the new FIELD_NAME value to avoid caching).

EXPLAIN (ANALYZE,BUFFERS) SELECT
   UNIT.ID AS UNIT_ID,
   UNIT.UNIT_ID AS UNIT_UNIT_ID,
   UNIT.BATCH_ID AS UNIT_BATCH_ID,
   UNIT.CREATE_DATE AS UNIT_CREATE_DATE,
   UNIT.UPDATE_DATE AS UNIT_UPDATE_DATE
FROM
   UNIT, FIELD, ANSWER
WHERE
   UNIT.ID=FIELD.UNIT_ID AND
   FIELD.ID=ANSWER.FIELD_ID AND
   FIELD.FIELD_NAME='RESP_PH_PREFIX_ID' AND
   ANSWER.ANS='2';

You can find the EXPLAIN plan here:

https://explain.depesz.com/s/apYR

I believe this fixes the issue as far as I can see.  I'm going to play
around with it more and see how it goes.

I wanted to also answer your question as to how many different values there
are for the FIELD_NAME (and how many rows of each value there are)

here is it:

SELECT FIELD_NAME,COUNT(*) FROM FIELD GROUP BY FIELD_NAME ORDER BY COUNT;

"PAGE_SERIAL";10315
"SHEETS_PRESENT";10315
"RESP_PH_AREA_ID";10556
"RESP_PH_PREFIX_ID";10559
"RESP_PH_SUFFIX_ID";10560
"H_LOC_ADD_NO_IND";10587
"H_TENURE_RENTED_IND";11162
"H_TENURE_OWNED_MORT_IND";11199
"H_TENURE_OWNED_FREE_IND";11208
"PAPER_JIC_2_TEXT";11413
"PAPER_JIC_1_TEXT";11413
"H_TENURE_OCC_NOPAY_IND";11478
"H_LOC_CHILDREN_IND";11496
"H_LOC_RELATIVES_IND";11496
"H_LOC_TEMPORARY_IND";11500
"H_LOC_NONRELATIVES_IND";11510
"PSEUDO_FIELD_MARGINALIA";87744
"H_SIZE_STATED_INT";207918
"P_REL_NO_IND";825240
"P_REL_YES_IND";825240
"P_REL_OTHER_NONREL_IND";1239894
"P_REL_CHILD_ADOPTED_IND";1239894
"P_REL_CHILD_BIO_IND";1239894
"P_REL_CHILD_FOSTER_IND";1239894
"P_REL_CHILD_STEP_IND";1239894
"P_REL_GRANDCHILD_IND";1239894
"P_REL_HOUSEROOMMATE_IND";1239894
"P_REL_INLAW_CHILD_IND";1239894
"P_REL_INLAW_PARENT_IND";1239894
"P_REL_OTHER_REL_IND";1239894
"P_REL_PARENT_IND";1239894
"P_REL_PARTNER_OPP_IND";1239894
"P_REL_PARTNER_SAME_IND";1239894
"P_REL_SIBLING_IND";1239894
"P_REL_SPOUSE_OPP_IND";1239894
"P_REL_SPOUSE_SAME_IND";1239894
"P_TRBSHR_CORP_NAME";1446204
"P_TRBSHR_YES_IND";1446204
"P_TRBSHR_NO_IND";1446204
"P_LOC_ELSE_COLLEGE_IND";1446204
"P_LOC_ELSE_JAIL_IND";1446204
"P_LOC_ELSE_JOB_IND";1446204
"P_LOC_ELSE_MILITARY_IND";1446204
"P_LOC_ELSE_NO_IND";1446204
"P_TRBENR_YES_IND";1446204
"P_LOC_ELSE_SEASONAL_IND";1446204
"P_LOC_ELSE_NURSINGHOME_IND";1446204
"P_TRBENR_TRIBE_NAME";1446204
"P_TRBENR_NO_IND";1446204
"P_LOC_ELSE_RELATIVES_IND";1446204
"P_LOC_ELSE_OTHER_IND";1446204
"P_RACE_WHITE_IND";1447812
"P_RACE2_TONGAN_IND";1447812
"P_RACE2_AFAM_IND";1447812
"P_RACE2_AIAN_TEXT";1447812
"P_RACE2_ASIANINDIAN_IND";1447812
"P_RACE2_ASIAN_TEXT";1447812
"P_RACE2_BLACK_TEXT";1447812
"P_RACE2_CHAMORRO_IND";1447812
"P_RACE2_CHINESE_IND";1447812
"P_RACE2_COLOMBIAN_IND";1447812
"P_RACE2_CUBAN_IND";1447812
"P_RACE2_DOMINICAN_IND";1447812
"P_RACE2_EGYPTIAN_IND";1447812
"P_RACE2_ENGLISH_IND";1447812
"P_RACE2_ETHIOPIAN_IND";1447812
"P_RACE2_FIJIAN_IND";1447812
"P_RACE2_FILIPINO_IND";1447812
"P_RACE2_FRENCH_IND";1447812
"P_RACE2_GERMAN_IND";1447812
"P_RACE2_HAITIAN_IND";1447812
"P_RACE2_HISP_TEXT";1447812
"P_RACE2_IRANIAN_IND";1447812
"P_RACE2_IRISH_IND";1447812
"P_RACE2_ISRAELI_IND";1447812
"P_RACE2_ITALIAN_IND";1447812
"P_RACE2_JAMAICAN_IND";1447812
"P_RACE2_JAPANESE_IND";1447812
"P_RACE2_KOREAN_IND";1447812
"P_RACE2_LEBANESE_IND";1447812
"P_RACE2_MARSHALLESE_IND";1447812
"P_RACE2_MENA_TEXT";1447812
"P_RACE2_MEXICAN_IND";1447812
"P_RACE2_MOROCCAN_IND";1447812
"P_RACE2_NATHAWAIIAN_IND";1447812
"P_RACE2_NHPI_TEXT";1447812
"P_RACE2_NIGERIAN_IND";1447812
"P_RACE2_POLISH_IND";1447812
"P_RACE2_PUERTORICAN_IND";1447812
"P_RACE2_SALVADORAN_IND";1447812
"P_RACE2_SAMOAN_IND";1447812
"P_RACE2_SOMALI_IND";1447812
"P_RACE2_SOR_TEXT";1447812
"P_RACE2_SYRIAN_IND";1447812
"P_RACE2_VIETNAMESE_IND";1447812
"P_RACE2_WHITE_TEXT";1447812
"P_RACE_AIAN_IND";1447812
"P_RACE_ASIAN_IND";1447812
"P_RACE_BLACK_IND";1447812
"P_RACE_HISP_IND";1447812
"P_RACE_MENA_IND";1447812
"P_RACE_NHPI_IND";1447812
"P_RACE_SOR_IND";1447812
"P_SEX_MALE_IND";2273052
"P_SEX_FEMALE_IND";2273052
"P_MIDDLE_NAME";2273052
"P_LAST_NAME";2273052
"P_FIRST_NAME";2273052
"P_BIRTH_YEAR_INT";2273052
"P_BIRTH_MONTH_INT";2273052
"P_BIRTH_DAY_INT";2273052
"P_AGE_INT";2273052


I want to give a HUGE thanks to everyone who took the time to look at my
issue and provide insight and assistance, you folks are truly awesome!



On Wed, Apr 26, 2017 at 12:12 AM, David Rowley  wrote:

> On 26 April 2017 at 15:19, Alessandro Ferrucci
>  wrote:
> > After about 40 inutes the slow query finally finished and the result of
> the
> > EXPLAIN plan can be found here:
> >
> > https://explain.depesz.com/s/BX22
>
> > Index Scan using 

Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Gerardo Herzig
Some other approaches you could try:

1) What about an hashed index? You could make 
CREATE INDEX ON FIELD (unit_id, hashtext(field_name))

and changing your query accordingly

"where hashtext(FIELD.FIELD_NAME)=hashtext('SHEETS_PRESENT') "

2) Partitioning (not native yet, but can be simulated through inheritance), 
like in
https://www.postgresql.org/docs/current/static/ddl-partitioning.html
This could work well if you have a sort of limited different values in 
FIELD.FIELD_NAME

Gerardo

- Mensaje original -
> De: "Alessandro Ferrucci" <alessandroferru...@gmail.com>
> Para: pgsql-performance@postgresql.org
> Enviados: Miércoles, 26 de Abril 2017 0:19:37
> Asunto: Re: [PERFORM] Slow query with 3 table joins
> 
> 
> 
> After about 40 inutes the slow query finally finished and the result
> of the EXPLAIN plan can be found here:
> 
> 
> https://explain.depesz.com/s/BX22
> 
> 
> Thanks,
> Alessandro Ferrucci
> 
> 
> On Tue, Apr 25, 2017 at 11:10 PM, Alessandro Ferrucci <
> alessandroferru...@gmail.com > wrote:
> 
> 
> 
> 
> Hello - I am migrating a current system to PostgreSQL and I am having
> an issue with a relatively straightforward query being extremely
> slow.
> 
> 
> The following are the definitions of the tables:
> 
> 
> CREATE TABLE popt_2017.unit
> (
> id serial NOT NULL,
> unit_id text,
> batch_id text,
> create_date timestamp without time zone DEFAULT now(),
> update_date timestamp without time zone,
> CONSTRAINT unit_pkey PRIMARY KEY (id)
> )
> WITH (
> OIDS=FALSE
> );
> 
> 
> CREATE TABLE popt_2017.field
> (
> id serial NOT NULL,
> unit_id integer,
> subunit_data_id integer,
> field_name character varying(50),
> page_id character varying(20),
> page_type character varying(20),
> batch_id character varying(20),
> file_name character varying(20),
> data_concept integer,
> "GROUP" integer,
> omr_group integer,
> pres integer,
> reg_data text,
> ocr_conf text,
> ocr_dict text,
> ocr_phon text,
> create_date timestamp without time zone DEFAULT now(),
> update_date timestamp without time zone,
> CONSTRAINT field_pkey PRIMARY KEY (id),
> CONSTRAINT field_subunit_data_id_fkey FOREIGN KEY (subunit_data_id)
> REFERENCES popt_2017.subunit (id) MATCH SIMPLE
> ON UPDATE NO ACTION ON DELETE NO ACTION,
> CONSTRAINT field_unit_id_fk FOREIGN KEY (unit_id)
> REFERENCES popt_2017.unit (id) MATCH FULL
> ON UPDATE NO ACTION ON DELETE NO ACTION,
> CONSTRAINT field_unit_id_fkey FOREIGN KEY (unit_id)
> REFERENCES popt_2017.unit (id) MATCH SIMPLE
> ON UPDATE NO ACTION ON DELETE NO ACTION
> )
> WITH (
> OIDS=FALSE
> );
> 
> 
> CREATE TABLE popt_2017.answer
> (
> id serial NOT NULL,
> field_id integer,
> ans_status integer,
> ans text,
> luggage text,
> arec text,
> kfi_partition integer,
> final boolean,
> length integer,
> create_date timestamp without time zone DEFAULT now(),
> update_date timestamp without time zone,
> CONSTRAINT answer_pkey PRIMARY KEY (id),
> CONSTRAINT answer_field_id_fk FOREIGN KEY (field_id)
> REFERENCES popt_2017.field (id) MATCH FULL
> ON UPDATE NO ACTION ON DELETE NO ACTION,
> CONSTRAINT answer_field_id_fkey FOREIGN KEY (field_id)
> REFERENCES popt_2017.field (id) MATCH SIMPLE
> ON UPDATE NO ACTION ON DELETE NO ACTION
> )
> WITH (
> OIDS=FALSE
> );
> 
> 
> Below are the index definitions for those tables:
> 
> 
> UNIT:
> CREATE UNIQUE INDEX unit_pkey ON unit USING btree (id);
> CREATE INDEX unit_unit_id_idx ON unit USING btree (unit_id);
> 
> 
> FIELD:
> CREATE UNIQUE INDEX field_pkey ON field USING btree (id)
> CREATE INDEX field_unit_id_idx ON field USING btree (unit_id)
> CREATE INDEX field_subunit_id_idx ON field USING btree
> (subunit_data_id)
> CREATE INDEX field_field_name_idx ON field USING btree (field_name)
> 
> 
> ANSWER:
> CREATE UNIQUE INDEX answer_pkey ON answer USING btree (id)
> CREATE INDEX answer_field_id_idx ON answer USING btree (field_id)
> CREATE INDEX answer_ans_idx ON answer USING btree (ans)
> 
> 
> The tables each have the following number of rows:
> 
> 
> UNIT: 10,315
> FIELD: 139,397,965
> ANSWER: 3,463,300
> 
> 
> The query in question is:
> 
> 
> SELECT
> UNIT.ID AS UNIT_ID,
> UNIT.UNIT_ID AS UNIT_UNIT_ID,
> UNIT.BATCH_ID AS UNIT_BATCH_ID,
> UNIT.CREATE_DATE AS UNIT_CREATE_DATE,
> UNIT.UPDATE_DATE AS UNIT_UPDATE_DATE
> FROM
> UNIT, FIELD, ANSWER
> WHERE
> UNIT.ID =FIELD.UNIT_ID AND
> FIELD.ID =ANSWER.FIELD_ID AND
> FIELD.FIELD_NAME='SHEETS_PRESENT' AND
> ANSWER.ANS='2';
> 
> 
> I attempted to run an EXPLAIN (ANALYZE,BUFFERS) and the 

Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Gerardo Herzig


- Mensaje original -
> De: "Alessandro Ferrucci" <alessandroferru...@gmail.com>
> Para: pgsql-performance@postgresql.org
> Enviados: Miércoles, 26 de Abril 2017 0:19:37
> Asunto: Re: [PERFORM] Slow query with 3 table joins
> 
> 
> 
> After about 40 inutes the slow query finally finished and the result
> of the EXPLAIN plan can be found here:
> 
> 
> https://explain.depesz.com/s/BX22
> 
> 
> Thanks,
> Alessandro Ferrucci

1) Looking at the "Rows removed by filter" in that explain, looks like a 
selectivity issue: Many (many many) rows are fetched, just to be rejected 
later. 
I think you can try a partial index on ''field (unit_id) where 
field_name="SHEETS_PRESENT"'', if it is practical to you.
See https://www.postgresql.org/docs/current/static/indexes-partial.html for a 
good read about partial indexes.

2) 9.2 is a pretty old version of PG. If you are migrating yet, you should 
consider a more recent version

HTH

Gerardo


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Johan Fredriksson

> Hi Eskil - 
> 
> 
> The I believe the id-field you're referring to is the UNIT.UNIT_ID, I
> could change this to a varchar, however that column is not used in the
> query in question, so that wouldn't have any effect on the query's
> performance.

Sorry, I did not notice that the column "unit_id" existed in both "unit"
and "field" tables.

/ Eskil





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Alessandro Ferrucci
Hi Dave -

thank you very much for all this advice!  I will try each of these and post
back results (some of this stuff, like creating the index, which is
happening now, takes a very long time).

Thanks again for all these pointers.

Cheers,
Alessandro

On Wed, Apr 26, 2017 at 12:12 AM, David Rowley  wrote:

> On 26 April 2017 at 15:19, Alessandro Ferrucci
>  wrote:
> > After about 40 inutes the slow query finally finished and the result of
> the
> > EXPLAIN plan can be found here:
> >
> > https://explain.depesz.com/s/BX22
>
> > Index Scan using field_unit_id_idx on field  (cost=0.00..8746678.52
> rows=850149 width=8) (actual time=0.030..2414345.998 rows=10315 loops=1)"
>
> This estimate seems a long way off. Are the stats up-to-date on the
> table? Try again after running: ANALYZE field;
>
> It might also be a good idea to ANALYZE all the tables. Is auto-vacuum
> switched on?
>
> The plan in question would work better if you create an index on field
> (field_name, unit_id);
>
> but I think if you update the stats the plan will switch.
>
> A HashJoin, hashing "unit"  and index scanning on field_field_name_idx
> would have been a much smarter plan choice for the planner to make.
>
> Also how many distinct field_names are there? SELECT COUNT(DISTINCT
> field_name) FROM field;
>
> You may want to increase the histogram buckets on that columns if
> there are more than 100 field names, and the number of rows with each
> field name is highly variable. ALTER TABLE field ALTER COLUMN
> field_name SET STATISTICS ; 100 is the default, and 1
> is the maximum.
>
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>



-- 
Signed,
Alessandro Ferrucci


Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Alessandro Ferrucci
Hi Eskil -

The I believe the id-field you're referring to is the UNIT.UNIT_ID, I could
change this to a varchar, however that column is not used in the query in
question, so that wouldn't have any effect on the query's performance.

Just for curiosity - I have changed the ANSWER.ANS datatype to a
varchar(250), but that did not affect the performance of the query.



On Wed, Apr 26, 2017 at 2:24 AM, Johan Fredriksson  wrote:

> tis 2017-04-25 klockan 23:19 -0400 skrev Alessandro Ferrucci:
> > After about 40 inutes the slow query finally finished and the result
> > of the EXPLAIN plan can be found here:
> >
> >
> > https://explain.depesz.com/s/BX22
> >
> >
> > Thanks,
> > Alessandro Ferrucci
>
> I'm not so familiar with the index implementetion in Postgres, but I
> don't think it is very efficient to index a text-field. It also loooks a
> bit strange that a id-field has the datatype "text" rather than integer
> or varchar.
>
> / Eskil
>
>
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>



-- 
Signed,
Alessandro Ferrucci


Re: [PERFORM] Slow query with 3 table joins

2017-04-26 Thread Johan Fredriksson
tis 2017-04-25 klockan 23:19 -0400 skrev Alessandro Ferrucci:
> After about 40 inutes the slow query finally finished and the result
> of the EXPLAIN plan can be found here:
> 
> 
> https://explain.depesz.com/s/BX22
> 
> 
> Thanks,
> Alessandro Ferrucci

I'm not so familiar with the index implementetion in Postgres, but I
don't think it is very efficient to index a text-field. It also loooks a
bit strange that a id-field has the datatype "text" rather than integer
or varchar.

/ Eskil





-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query with 3 table joins

2017-04-25 Thread David Rowley
On 26 April 2017 at 15:19, Alessandro Ferrucci
 wrote:
> After about 40 inutes the slow query finally finished and the result of the
> EXPLAIN plan can be found here:
>
> https://explain.depesz.com/s/BX22

> Index Scan using field_unit_id_idx on field  (cost=0.00..8746678.52 
> rows=850149 width=8) (actual time=0.030..2414345.998 rows=10315 loops=1)"

This estimate seems a long way off. Are the stats up-to-date on the
table? Try again after running: ANALYZE field;

It might also be a good idea to ANALYZE all the tables. Is auto-vacuum
switched on?

The plan in question would work better if you create an index on field
(field_name, unit_id);

but I think if you update the stats the plan will switch.

A HashJoin, hashing "unit"  and index scanning on field_field_name_idx
would have been a much smarter plan choice for the planner to make.

Also how many distinct field_names are there? SELECT COUNT(DISTINCT
field_name) FROM field;

You may want to increase the histogram buckets on that columns if
there are more than 100 field names, and the number of rows with each
field name is highly variable. ALTER TABLE field ALTER COLUMN
field_name SET STATISTICS ; 100 is the default, and 1
is the maximum.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Slow query with 3 table joins

2017-04-25 Thread Alessandro Ferrucci
Hello - I am migrating a current system to PostgreSQL and I am having an
issue with a relatively straightforward query being extremely slow.

The following are the definitions of the tables:

CREATE TABLE popt_2017.unit
(
  id serial NOT NULL,
  unit_id text,
  batch_id text,
  create_date timestamp without time zone DEFAULT now(),
  update_date timestamp without time zone,
  CONSTRAINT unit_pkey PRIMARY KEY (id)
)
WITH (
  OIDS=FALSE
);

CREATE TABLE popt_2017.field
(
  id serial NOT NULL,
  unit_id integer,
  subunit_data_id integer,
  field_name character varying(50),
  page_id character varying(20),
  page_type character varying(20),
  batch_id character varying(20),
  file_name character varying(20),
  data_concept integer,
  "GROUP" integer,
  omr_group integer,
  pres integer,
  reg_data text,
  ocr_conf text,
  ocr_dict text,
  ocr_phon text,
  create_date timestamp without time zone DEFAULT now(),
  update_date timestamp without time zone,
  CONSTRAINT field_pkey PRIMARY KEY (id),
  CONSTRAINT field_subunit_data_id_fkey FOREIGN KEY (subunit_data_id)
  REFERENCES popt_2017.subunit (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT field_unit_id_fk FOREIGN KEY (unit_id)
  REFERENCES popt_2017.unit (id) MATCH FULL
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT field_unit_id_fkey FOREIGN KEY (unit_id)
  REFERENCES popt_2017.unit (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
  OIDS=FALSE
);

CREATE TABLE popt_2017.answer
(
  id serial NOT NULL,
  field_id integer,
  ans_status integer,
  ans text,
  luggage text,
  arec text,
  kfi_partition integer,
  final boolean,
  length integer,
  create_date timestamp without time zone DEFAULT now(),
  update_date timestamp without time zone,
  CONSTRAINT answer_pkey PRIMARY KEY (id),
  CONSTRAINT answer_field_id_fk FOREIGN KEY (field_id)
  REFERENCES popt_2017.field (id) MATCH FULL
  ON UPDATE NO ACTION ON DELETE NO ACTION,
  CONSTRAINT answer_field_id_fkey FOREIGN KEY (field_id)
  REFERENCES popt_2017.field (id) MATCH SIMPLE
  ON UPDATE NO ACTION ON DELETE NO ACTION
)
WITH (
  OIDS=FALSE
);

Below are the index definitions for those tables:

UNIT:
CREATE UNIQUE INDEX unit_pkey ON unit USING btree (id);
CREATE INDEX unit_unit_id_idx ON unit USING btree (unit_id);

FIELD:
CREATE UNIQUE INDEX field_pkey ON field USING btree (id)
CREATE INDEX field_unit_id_idx ON field USING btree (unit_id)
CREATE INDEX field_subunit_id_idx ON field USING btree (subunit_data_id)
CREATE INDEX field_field_name_idx ON field USING btree (field_name)

ANSWER:
CREATE UNIQUE INDEX answer_pkey ON answer USING btree (id)
CREATE INDEX answer_field_id_idx ON answer USING btree (field_id)
CREATE INDEX answer_ans_idx ON answer USING btree (ans)

The tables each have the following number of rows:

UNIT: 10,315
FIELD: 139,397,965
ANSWER: 3,463,300

The query in question is:

SELECT
   UNIT.ID AS UNIT_ID,
   UNIT.UNIT_ID AS UNIT_UNIT_ID,
   UNIT.BATCH_ID AS UNIT_BATCH_ID,
   UNIT.CREATE_DATE AS UNIT_CREATE_DATE,
   UNIT.UPDATE_DATE AS UNIT_UPDATE_DATE
FROM
   UNIT, FIELD, ANSWER
WHERE
   UNIT.ID=FIELD.UNIT_ID AND
   FIELD.ID=ANSWER.FIELD_ID AND
   FIELD.FIELD_NAME='SHEETS_PRESENT' AND
   ANSWER.ANS='2';

I attempted to run an EXPLAIN (ANALYZE,BUFFERS) and the query has been
running for 32 minutes now, So I won't be able to post the results (as I've
never been able to get the query to actually finish.

But, if I remove the join to UNIT (and just join FIELD and ANSWER) the
resulting query is sufficiently fast, (the first time it ran in roughly 3
seconds), the query as such is:

SELECT * FROM
   ANSWER, FIELD
WHERE
   FIELD.ID=ANSWER.FIELD_ID AND
   FIELD.FIELD_NAME='SHEETS_PRESENT' AND
   ANSWER.ANS='2';

The EXPLAIN ( ANALYZE, BUFFERS ) output of that query can be found here
https://explain.depesz.com/s/ueJq

These tables are static for now, so they do not get DELETEs or INSERTS at
all and I have run VACUUM ANALYZE on all the affected tables.

I'm running PostgreSQL PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu,
compiled by gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-52), 64-bit

I'm running this on RHEL 6.9

On a server with 32 GB of ram, 2 CPUs.

The following are the changes to postgresql.conf that I have made:

shared_buffers = 7871MB
effective_cache_size = 23611MB
work_mem = 1000MB
maintenance_work_mem = 2048MB

I have not changed the autovacuum settings, but since the tables are static
for now and I've already ran VACUUM that should not have any effect.

Any assistance that could be provided is greatly appreciated.

Thank you,
Alessandro Ferrucci