When planning queries with a large IN expression in the WHERE clause, the planner transforms the IN list into a scalar array expression. In clause_selectivity(), we estimate the selectivity of the ScalarArrayExpr by calling scalararraysel(), which in turn estimates the selectivity of *each* array element in order to determine the selectivity of the array expression as a whole.
This is quite inefficient when the IN list is large. In a test case that someone sent me privately, a simple query involving two cheap joins and a ~1800 element IN list in the WHERE clause requires about 100ms to plan but only ~10 ms to execute -- about 85% of the total runtime is spent in scalararraysel(). (I'd include the profiling data, but KCacheGrind seems stubbornly opposed to providing a textual summary of its results...) Clearly, the current approach is fine when the array is small -- perhaps for arrays above a certain number of elements, we could switch to randomly sampling array elements, estimating their selectivities, and then using that information to infer the estimated selectivity of the entire array expression. That seems fairly crude, though: does anyone have any better ideas? -Neil ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match