Am Montag, 30. Januar 2006 17:42 schrieb Tom Lane: > Philipp Reisner <[EMAIL PROTECTED]> writes: > > Version: 8.0.2 > > I don't think so ... neither bitmap scans nor slot_getattr exist in 8.0. >
Sorry, a typo, 8.1.2 of course. > > Is this bug-report of any use so far ? > > Not a lot. You need to find out what the difference is between the > cases where the query runs quickly and those where it doesn't. I'm > betting that the planner is choosing a very bad plan in the slow cases, > but there's not a lot of evidence here to show what or why. > > The explain analyze output shows two levels of hash join underneath > four levels of nestloop join, whereas the stack trace looks like there > are five levels of nestloop and only one hash. So this is some evidence > that a different plan is being used in the slow cases. The stack trace > doesn't provide nearly enough info about what that plan is, though. > What reasons are there for the planer to choose a different plan ? > Do you have geqo_threshold set to less than its default value? Do you > sometimes execute the query with different sets of parameters? Either > of these might lead to changes of plan. Since we had very bad experience with the generic query optimizer we completely disabled it. The non geqo plans perform multitudes better on our data-set and work-load. -- And the time to plan the query is always negligible compared to the execution time. We have the endless loop about 10 times a day (but execute about 1.5 million queries a day) : Program received signal SIGINT, Interrupt. 0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6 (gdb) bt #0 0x00002aaaaba687b4 in wcsrtombs () from /lib/libc.so.6 #1 0x00002aaaaba163c8 in wcstombs () from /lib/libc.so.6 #2 0x000000000058dcc4 in text_oid () #3 0x000000000058deb9 in upper () #4 0x00000000004e216e in ExecMakeFunctionResult () #5 0x00000000004e210d in ExecMakeFunctionResult () #6 0x00000000004e5063 in ExecQual () #7 0x00000000004e554e in ExecScan () #8 0x00000000004e0cc8 in ExecProcNode () #9 0x00000000004ebeb6 in ExecEndHashJoin () #10 0x00000000004eb7fa in ExecHashJoin () #11 0x00000000004e0d50 in ExecProcNode () #12 0x00000000004ee29d in ExecNestLoop () #13 0x00000000004e0d36 in ExecProcNode () #14 0x00000000004ee261 in ExecNestLoop () <<----------- #15 0x00000000004e0d36 in ExecProcNode () #16 0x00000000004ee261 in ExecNestLoop () #17 0x00000000004e0d36 in ExecProcNode () #18 0x00000000004ee261 in ExecNestLoop () #19 0x00000000004e0d36 in ExecProcNode () #20 0x00000000004ee261 in ExecNestLoop () #21 0x00000000004e0d36 in ExecProcNode () #22 0x00000000004ef57c in ExecSort () #23 0x00000000004e0d6a in ExecProcNode () #24 0x00000000004df73f in ExecEndPlan () #25 0x00000000004dec09 in ExecutorRun () #26 0x000000000055f87a in PortalRun () #27 0x000000000055f5f9 in PortalRun () #28 0x000000000055b561 in pg_plan_queries () #29 0x000000000055df8c in PostgresMain () #30 0x0000000000539b01 in ClosePostmasterPorts () #31 0x00000000005395d0 in ClosePostmasterPorts () #32 0x0000000000537b4d in PostmasterMain () #33 0x0000000000536e59 in PostmasterMain () #34 0x00000000004fda0c in main () After detaching with gdb, I killed it. I configured postgres to dump the statement to the logfile with "log_min_error_statement = error". So I have the exact query that caused the problem. This time the plainer says, only explain: Sort (cost=16257.75..16257.76 rows=1 width=262) Sort Key: con.shortname -> Nested Loop (cost=1.18..16257.74 rows=1 width=262) -> Nested Loop (cost=1.18..16254.57 rows=1 width=130) -> Nested Loop (cost=1.18..16251.50 rows=1 width=125) -> Nested Loop (cost=1.18..16248.45 rows=1 width=111) -> Nested Loop (cost=1.18..16245.31 rows=1 width=86) Join Filter: ("outer".cel_objid = "inner".objid) -> Seq Scan on devices dev (cost=0.00..11984.25 rows=1 width=63) Filter: ((upper((COALESCE(isactiv, 'Y'::character varying))::text) = 'Y'::text) AND (upper((COALESCE(iscommittedsp, 'Y'::character varying))::text) = 'Y'::text) AND (upper((COALESCE(iscommittedcust, 'Y'::character varying))::text) = 'Y'::text)) -> Hash Left Join (cost=1.18..4256.22 rows=387 width=27) Hash Cond: ("outer".ser_objid = "inner".objid) -> Seq Scan on contractelements cel (cost=0.00..4253.10 rows=387 width=21) Filter: (upper((isactiv)::text) = 'Y'::text) -> Hash (cost=1.14..1.14 rows=14 width=14) -> Seq Scan on servicetypes ser (cost=0.00..1.14 rows=14 width=14) -> Index Scan using contracts_pkey on contracts con (cost=0.00..3.13 rows=1 width=33) Index Cond: ("outer".con_objid = con.objid) Filter: ((ccu_objid = 3837149) OR (ccu_objid = 3837320) OR (ccu_objid = 3837375) OR (ccu_objid = 3974872) OR (ccu_objid = 5032031) OR (ccu_objid = 7220313) OR (ccu_objid = 15684899) OR (ccu_objid = 16052168) OR (ccu_objid = 19196573) OR (ccu_objid = 20095712) OR (ccu_objid = 22399813) OR (ccu_objid = 42092332) OR (ccu_objid = 42092454) OR (ccu_objid = 42092574) OR (ccu_objid = 42886947) OR (ccu_objid = 43813234)) -> Index Scan using devicetypes_pkey on devicetypes dty (cost=0.00..3.04 rows=1 width=18) Index Cond: (dty.objid = "outer".dty_objid) -> Index Scan using bpartners_pkey on bpartners man (cost=0.00..3.06 rows=1 width=13) Index Cond: (man.objid = "outer".bpa_objid) -> Index Scan using locations_pkey on locations loc (cost=0.00..3.12 rows=1 width=136) Index Cond: (loc.objid = "outer".loc_objid) Filter: (upper((shortname)::text) = '5195'::text) (26 rows) Explain and execute does not terminate... even after hours. This time I _CAN_ _REPRODUCE_ the issue in our staging system !! Would it help to run the database server with debug symbols ? That would give more information in the stack traceback ... Tom, could you give us an advise for what we should look for ? How we should continue in tracing the bug down ? In case you care, here is the query as well: select cel.objid, cel.shortname, ser.shortname, con.isInventoryFlag, dev.objid, dty.objid, loc.objid, dev.shortname, loc.shortname, loc.name, man.shortName, dty.type, dev.serNr, dev.invNr, dev.component, loc.zip, loc.city, con.shortname, con.name, CASE WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationstateTo THEN loc.locationstateText WHEN CURRENT_DATE between loc.locationstate2From and loc.locationstate2To THEN loc.locationstate2Text ELSE null END , CASE WHEN CURRENT_DATE between loc.locationstateFrom and loc.locationstateTo THEN loc.locationstateColorCode WHEN CURRENT_DATE between loc.locationstate2From and loc.locationstate2To THEN loc.locationstate2ColorCode ELSE null END , loc.locationcategory, loc.province from contracts con, contractelements cel left outer join servicetypes ser on ser.objid = cel.ser_objid, devices dev, locations loc, devicetypes dty, bpartners man where con.ccu_objid in (3837149,3837320,3837375,3974872,5032031,7220313,15684899,16052168,19196573,20095712,22399813,42092332,42092454,42092574,42886947,43813234) and cel.con_objid = con.objid and dev.cel_objid = cel.objid and loc.objid = dev.loc_objid and dty.objid = dev.dty_objid and man.objid = dty.bpa_objid and upper(cel.isActiv) = 'Y' and upper(coalesce(dev.isActiv,'Y')) = 'Y' and upper(coalesce(dev.IsCommittedSP,'Y')) = 'Y' and upper(coalesce(dev.IsCommittedCust,'Y')) = 'Y' and upper(loc.shortName) = '5195' order by 18 ASC -philipp -- : Dipl-Ing Philipp Reisner Tel +43-1-8178292-50 : : LINBIT Information Technologies GmbH Fax +43-1-8178292-82 : : Schönbrunnerstr 244, 1120 Vienna, Austria http://www.linbit.com : ---------------------------(end of broadcast)--------------------------- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly