Re: Import Statistics in postgres_fdw before resorting to sampling.

Corey Huinker Sat, 03 Jan 2026 18:57:03 -0800

>
>
> This version of the patch wouldn't fall back to the normal ANALYZE
> processing anymore, so this documentation should be updated as such.
> Also, as I think it's the user's responsibility to ensure the existing
> statistics are up-to-date, as I said before, I think we should add a
> note about that here.  Also, as some users wouldn't be able to ensure
> it, I'm wondering if the default should be false.
>


I agree, if there is no fallback, then the default should be false. When I
was initially brainstorming this patch, Nathan Bossart had suggested making
it the default because 1) that would be an automatic benefit to users and
2) the cost for attempting to import stats was small in comparison to a
table stample, so it was worth the attempt. I still want users to get that
automatic benefit, but if there is no fallback to sampling then the default
only makes sense as false.


> If the user wasn't able to ensure the statistics are up-to-date, I
> think he/she might want to do remote ANALYZE *before* fetching the
> statistics, not after trying to do so.  So I think we could instead
> provide this option as such.  What do you think about that?  Anyway,
> I'd vote for leaving this for another patch, as I think it's a
> nice-to-have option rather than a must-have one, as I said before.
>

I hadn't thought of that one. I think it has some merit, but I think we'd
want the try-after case as well. So the settings would be: "off" (never
remote analyze, default), "on" (always analyze before fetching), and
"retry" (analyze if no stats were found). This feature feeds into the same
thinking that the default setting did, which was to make this feature just
do what is usually the smart thing, and do it automatically. Building it in
pieces might be easier to get committed, but it takes away the dream of
seamless automatic improvement, and establishes defaults that can't be
changed in future versions. That dream was probably unrealistic, but I
wanted to try.


>
> ISTM that the code is well organized overall.  Here are a few comments:
>

Thanks!


>
> We don't use relallvisible and relallfrozen for foreign tables (note
> that do_analyze_rel() calls vac_update_relstats() with relallvisible=0
> and relallfrozen=0 for them).  Do we really need to retrieve (and
> restore) them?
>

No, and as you stated, we wouldn't want to. The query was lifted verbatim
from pg_dump, with a vague hope of moving the queries to a common library
that both pg_dump and postgres_fdw could draw upon. But that no longer
makes sense, so I'll fix.


> +static const char *attstats_query_17 =
> +   "SELECT DISTINCT ON (s.attname) attname, s.null_frac, s.avg_width, "
> +   "s.n_distinct, s.most_common_vals, s.most_common_freqs, "
> +   "s.histogram_bounds, s.correlation, s.most_common_elems, "
> +   "s.most_common_elem_freqs, s.elem_count_histogram, "
> +   "s.range_length_histogram, s.range_empty_frac,
> s.range_bounds_histogram "
> +   "FROM pg_catalog.pg_stats AS s "
> +   "WHERE s.schemaname = $1 AND s.tablename = $2 "
> +   "ORDER BY s.attname, s.inherited DESC";
>
> I think we should retrieve the attribute statistics for only the
> referenced columns of the remote table, not all the columns of it, to
> reduce the data transfer and the cost of matching local/remote
> attributes in import_fetched_statistics().
>

I thought about this, and decided that we wanted to 1) avoid per-column
round trips and 2) keep the remote queries simple. It's not such a big deal
to add "AND s.attname = ANY($3)" and construct a '{att1,att2,"ATt3"}'
string, as we already do in pg_dump in a few places.


>
> In fetch_remote_statistics()
>
> +   if (server_version_num >= 180000)
> +       relation_sql = relstats_query_18;
> +   else if (server_version_num >= 140000)
> +       relation_sql = relstats_query_14;
> +   else
> +       relation_sql = relstats_query_default;
>
> I think that having the definition for each of relstats_query_18,
> relstats_query_14 and relstats_query_default makes the code redundant
> and the maintenance hard.  To avoid that, how about building the
> relation_sql query dynamically as done for the query to fetch all
> table data from the remote server in postgresImportForeignSchema().
> Same for the attribute_sql query.
>

It may be a moot point. If we're not fetching relallfrozen, then the 14 &
18 cases are now the same, and since the pre-14 case concerns
differentiating between analyzed and unanalyzed tables, we would just map
that to 0 IF we kept those stats, but we almost never would because an
unanalyzed remote table would not have the attribute stats necessary to
qualify as a good remote fetch. So we're down to just one static query.


>
> That's all I have for now.  I will continue to review the changes.
>

Much appreciated.

Re: Import Statistics in postgres_fdw before resorting to sampling.

Reply via email to