Re: UUID v7

2024-01-18 Thread Lukas Fittl
On Thu, Jan 18, 2024 at 5:18 AM Andrey Borodin  wrote:

> Current patch version attached. I've addressed all other requests:
> function renames, aliases, multiple functions instead of optional params,
> cleaner catalog definitions, not throwing error when [var,ver,time] value
> is unknown.
> What is left: deal with timezones, improve documentation.
>

I've done a test of the v10 patch, and ran into an interesting behavior
when passing in a timestamp to the function (which, as a side note, is
actually very useful to have as a feature, to support creating time-based
range partitions on UUIDv7 fields):

postgres=# SELECT uuid_extract_time(uuidv7());
 uuid_extract_time
---
 2024-01-18 18:49:00.01-08
(1 row)

postgres=# SELECT uuid_extract_time(uuidv7('2024-04-01'));
   uuid_extract_time

 2024-04-01 00:00:00-07
(1 row)

postgres=# SELECT uuid_extract_time(uuidv7());
   uuid_extract_time

 2024-04-01 00:00:00-07
(1 row)

Note how calling the uuidv7 function again after having called it with a
fixed future timestamp, returns the future timestamp, even though it should
return the current time.

I believe this is caused by incorrectly re-using the cached
previous_timestamp. In the second call here (with a fixed future
timestamp), we end up setting ts and tms to 2024-04-01, with
increment_counter = false, which leads us to set previous_timestamp to the
passed in timestamp (else branch of the second if in uuidv7). When we then
call the function again without an argument, we end up getting a new
timestamp from gettimeofday, but because we try to detect backwards leaps,
we set increment_counter to true, and thus end up reusing the previous
(future) timestamp here:

/* protection from leap backward */
tms = previous_timestamp;

Not sure how to fix this, but clearly something is amiss here.

Thanks,
Lukas

-- 
Lukas Fittl


Re: Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-07-06 Thread Lukas Fittl
On Thu, Jul 6, 2023 at 12:56 AM Daniel Gustafsson  wrote:

> Lukas: do you have an updated patch for this commitfest to address David's
> comments?
>

I have a draft - I should be able to post an updated patch in the next
days. Thanks for checking!

Thanks,
Lukas

-- 
Lukas Fittl


Add estimated hit ratio to Memoize in EXPLAIN to explain cost adjustment

2023-03-04 Thread Lukas Fittl
Hi,

I was debugging a planner problem on Postgres 14.4 the other day - and the
involved "bad" plan was including Memoize - though I don't necessarily
think that Memoize is to blame (and this isn't any of the problems recently
fixed in Memoize costing).

However, what I noticed whilst trying different ways to fix the plan, is
that the Memoize output was a bit hard to reason about - especially since
the plan involving Memoize was expensive to run, and so I was mostly
running EXPLAIN without ANALYZE to look at the costing.

Here is an example of the output I was looking at:

 ->  Nested Loop  (cost=1.00..971672.56 rows=119623 width=0)
   ->  Index Only Scan using table1_idx on table1
(cost=0.43..372676.50 rows=23553966 width=8)
   ->  Memoize  (cost=0.57..0.61 rows=1 width=8)
 Cache Key: table1.table2_id
 Cache Mode: logical
 ->  Index Scan using table2_idx on table2
(cost=0.56..0.60 rows=1 width=8)
   Index Cond: (id = table1.table2_id)

The other plan I was comparing with (that I wanted the planner to choose
instead), had a total cost of 1,451,807.35 -- and so I was trying to figure
out why the Nested Loop was costed as 971,672.56.

Simple math makes me expect the Nested Loop should roughly have a total
cost of14,740,595.76 here (372,676.50 + 23,553,966 * 0.61), ignoring a lot
of the smaller costs. Thus, in this example, it appears Memoize made the
plan cost significantly cheaper (roughly 6% of the regular cost).

Essentially this comes down to the "cost reduction" performed by Memoize
only being implicitly visible in the Nested Loop's total cost - and with
nothing useful on the Memoize node itself - since the rescan costs are not
shown.

I think explicitly adding the estimated cache hit ratio for Memoize nodes
might make this easier to reason about, like this:

->  Memoize  (cost=0.57..0.61 rows=1 width=8)
 Cache Key: table1.table2_id
 Cache Mode: logical
 Cache Hit Ratio Estimated: 0.94

Alternatively (or in addition) we could consider showing the "ndistinct"
value that is calculated in cost_memoize_rescan - since that's the most
significant contributor to the cache hit ratio (and you can influence that
directly by improving the ndistinct statistics).

See attached a patch that implements showing the cache hit ratio as a
discussion starter.

I'll park this in the July commitfest for now.

Thanks,
Lukas

-- 
Lukas Fittl


v1-0001-Add-Estimated-Cache-Hit-Ratio-for-Memoize-plan-no.patch
Description: Binary data


Re: Sampling-based timing for EXPLAIN ANALYZE

2023-01-17 Thread Lukas Fittl
On Fri, Jan 6, 2023 at 1:19 AM Jelte Fennema  wrote:

> Nice addition! And the code looks pretty straight forward.
>

Thanks for reviewing!

The current patch triggers warnings:
> https://cirrus-ci.com/task/6016013976731648 Looks like you need to add
> void as the argument.
>

Fixed in v2 attached. This also adds a simple regression test, as well as
fixes the parallel working handling.

Do you have some performance comparison between TIMING ON and TIMING
> SAMPLING?
>

Here are some benchmarks of auto_explain overhead on my ARM-based M1
Macbook for the following query run with pgbench on a scale factor 100 data
set:

SELECT COUNT(*) FROM pgbench_branches JOIN pgbench_accounts USING (bid)
JOIN pgbench_tellers USING (bid) WHERE bid = 42;

(the motivation is to use a query that is more complex than the standard
pgbench select-only test query)

avg latency (best of 3), -T 300, -c 4, -s 100, shared_buffers 2GB, fsync
off, max_parallel_workers_per_gather 0:

master, log_timing = off: 871 ms (878 / 877 / 871)
patch, log_timing = off: 869 ms (882 / 880 / 869)
patch, log_timing = on: 890 ms (917 / 930 / 890)
patch, log_timing = sampling, samplefreq = 1000: 869 ms (887 / 869 / 894)

Additionally, here is Andres' benchmark from [1], with the sampling option
added:

%  psql -Xc 'DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT * FROM
generate_series(1, 10) g(i);' postgres && pgbench -n -r -t 100 -f
<(echo -e "SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING OFF) SELECT
COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM
t;EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;") postgres |grep '^ '
DROP TABLE
SELECT 10
 3.507   0  SELECT COUNT(*) FROM t;
 3.476   0  EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*)
FROM t;
 3.576   0  EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT
COUNT(*) FROM t;
 5.096   0  EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*)
FROM t;

My pg_test_timing data for reference:

% pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
  < us   % of total  count
 1 97.64472  123876325
 2  2.354212986658
 4  0.00022277
 8  0.00016202
16  0.00064815
32  0.5 64

In InstrStartSampling there's logic to increase/decrease the frequency of
> an already existing timer. It's not clear to me when this can occur. I'd
> expect sampling frequency to remain constant throughout an explain plan. If
> it's indeed needed, I think a code comment would be useful to explain why
> this edge case is necessary.
>

Clarified in a code comment in v2. This is needed for handling nested
statements which could have different sampling frequencies for each nesting
level, i.e. a function might want to sample it's queries at a higher
frequency than its caller.

Thanks,
Lukas

[1] https://postgr.es/m/20230116213913.4oseovlzvc2674z7%40awork3.anarazel.de

-- 
Lukas Fittl


v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patch
Description: Binary data


Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2023-01-02 Thread Lukas Fittl
Hi David,

Thanks for continuing to work on this patch, and my apologies for silence
on the patch.

Its been hard to make time, and especially so because I typically develop
on an ARM-based macOS system where I can't test this directly - hence my
tests with virtualized EC2 instances, where I ran into the timing oddities.

On Mon, Jan 2, 2023 at 5:28 AM David Geier  wrote:

> The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
> return double. This seems error prone. What about renaming the function
> or also have the function return a double and cast where necessary at
> the call site?
>

Minor note, but in my understanding using a uint64 (where we can) is faster
for any simple arithmetic we do with the values.


> If no one objects I would also re-register this patch in the commit fest.
>

+1, and feel free to carry this patch forward - I'll try to make an effort
to review my earlier testing issues again, as well as your later
improvements to the patch.

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE
with timing over in [0], using a sampling-based approach to reduce the
timing overhead.

[0]:
https://www.postgresql.org/message-id/CAP53PkxXMk0j-%2B0%3DYwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g%40mail.gmail.com

Thanks,
Lukas

-- 
Lukas Fittl


Sampling-based timing for EXPLAIN ANALYZE

2023-01-02 Thread Lukas Fittl
Hi,

Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on
modern hardware (despite time sources being faster), I'd like to propose a
new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to
TIMING ON.

This new timing mode uses a timer on a fixed recurring frequency (e.g. 100
or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead
of getting the time on-demand when InstrStartNode/InstrStopNode is called.
To implement the timer, we can use the existing timeout infrastructure,
which is backed by a wall clock timer (ITIMER_REAL).

Conceptually this is inspired by how sampling profilers work (e.g. "perf"),
but it ties into the existing per-plan node instrumentation done by EXPLAIN
ANALYZE, and simply provides a lower accuracy version of the total time for
each plan node.

In EXPLAIN output this is marked as "sampled time", and scaled to the total
wall clock time (to adjust for the sampling undercounting):

=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
 QUERY PLAN

-
 HashAggregate  (cost=201747.90..201748.00 rows=10 width=12) (actual
sampled time=5490.974 rows=9 loops=1)
   ...
   ->  Hash Join  (cost=0.23..199247.90 rows=49 width=4) (actual
sampled time=3738.619 rows=900 loops=1)
 ...
 ->  Seq Scan on large  (cost=0.00..144247.79 rows=979 width=4)
(actual sampled time=1004.671 rows=1001 loops=1)
   ...
 ->  Hash  (cost=0.10..0.10 rows=10 width=4) (actual sampled
time=0.000 rows=10 loops=1)
   ...
 Execution Time: 5491.475 ms
---

In simple query tests like this on my local machine, this shows a
consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING
OFF), whilst providing a "good enough" accuracy to identify which part of
the query was problematic.

Attached is a prototype patch for early feedback on the concept, with tests
and documentation to come in a follow-up. Since the January commitfest is
still marked as open I'll register it there, but note that my assumption is
this is *not* Postgres 16 material.

As an open item, note that in the patch the requested sampling frequency is
not yet passed to parallel workers (it always defaults to 1000 Hz when
sampling is enabled). Also, note the timing frequency is limited to a
maximum of 1000 Hz (1ms) due to current limitations of the timeout
infrastructure.

With thanks to Andres Freund for help on refining the idea, collaborating
on early code and finding the approach to hook into the timeout API.

Thanks,
Lukas

-- 
Lukas Fittl


v1-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patch
Description: Binary data


Re: pg_stat_bgwriter.buffers_backend is pretty meaningless (and more?)

2022-09-30 Thread Lukas Fittl
On Tue, Sep 27, 2022 at 11:20 AM Melanie Plageman 
wrote:

> v30 attached
> rebased and pgstat_io_ops.c builds with meson now
> also, I tested with pgstat_report_stat() only flushing when forced and
> tests still pass
>

First of all, I'm excited about this patch, and I think it will be a big
help to understand better which part of Postgres is producing I/O (and why).

I've paired up with Maciek (CCed) on a review of this patch and had a few
comments, focused on the user experience:

The term "strategy" as an "io_context" is hard to understand, as its not a
concept an end-user / DBA would be familiar with. Since this comes from
BufferAccessStrategyType (i.e. anything not NULL/BAS_NORMAL is treated as
"strategy"), maybe we could instead split this out into the individual
strategy types? i.e. making "strategy" three different I/O contexts
instead: "shared_bulkread", "shared_bulkwrite" and "shared_vacuum",
retaining "shared" to mean NULL / BAS_NORMAL.

Separately, could we also track buffer hits without incurring extra
overhead? (not just allocs and reads) -- Whilst we already have shared read
and hit counters in a few other places, this would help make the common
"What's my cache hit ratio" question more accurate to answer in the
presence of different shared buffer access strategies. Tracking hits could
also help for local buffers (e.g. to tune temp_buffers based on seeing a
low cache hit ratio).

Additionally, some minor notes:

- Since the stats are counting blocks, it would make sense to prefix the
view columns with "blks_", and word them in the past tense (to match
current style), i.e. "blks_written", "blks_read", "blks_extended",
"blks_fsynced" (realistically one would combine this new view with other
data e.g. from pg_stat_database or pg_stat_statements, which all use the
"blks_" prefix, and stop using pg_stat_bgwriter for this which does not use
such a prefix)

- "alloc" as a name doesn't seem intuitive (and it may be confused with
memory allocations) - whilst this is already named this way in
pg_stat_bgwriter, it feels like this is an opportunity to eventually
deprecate the column there and make this easier to understand -
specifically, maybe we can clarify that this means buffer *acquisitions*?
(either by renaming the field to "blks_acquired", or clarifying in the
documentation)

- Assuming we think this view could realistically cover all I/O produced by
Postgres in the future (thus warranting the name "pg_stat_io"), it may be
best to have an explicit list of things that are not currently tracked in
the documentation, to reduce user confusion (i.e. WAL writes are not
tracked, temporary files are not tracked, and some forms of direct writes
are not tracked, e.g. when a table moves to a different tablespace)

- In the view documentation, it would be good to explain the different
values for "io_strategy" (and what they mean)

- Overall it would be helpful if we had a dedicated documentation page on
I/O statistics that's linked from the pg_stat_io view description, and
explains how the I/O statistics tie into the various concepts of shared
buffers / buffer access strategies / etc (and what is not tracked today)

Thanks,
Lukas

-- 
Lukas Fittl


Re: pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

2022-08-09 Thread Lukas Fittl
On Tue, Aug 9, 2022 at 5:33 PM Tom Lane  wrote:

> I would say that pg_get_indexdef is the one that's out of step.
> I count 11 calls of generate_relation_name in ruleutils.c,
> of which only three have this business of being overridden
> when not-pretty.  What is the rationale for that, and why
> would we move pg_get_constraintdef from one category to the
> other?
>

The overall motivation here is to make it easy to recreate the schema
without having to match the search_path on the importing side to be
identical to the exporting side. There is a workaround, which is to do a
SET search_path before calling these functions that excludes the referenced
schemas (which I guess is what pg_dump does?).

But I wonder, why do we have an explicit pretty printing flag on these
functions, and PRETTYFLAG_SCHEMA in the code to represent this behavior. If
we don't want pretty printing to affect schema qualification, why does that
flag exist?

Of the other call sites, in terms of using generate_relation_name vs
generate_qualified_relation_name:

* pg_get_triggerdef_worker makes it conditional on pretty=true, but only
for ON, not the FROM (not clear why that difference exists?)
* pg_get_indexdef_worker makes it conditional on prettyFlags &
PRETTYFLAG_SCHEMA for the ON
* pg_get_statisticsobj_worker does not handle pretty printing (always uses
generate_relation_name)
* make_ruledef makes it conditional on prettyFlags & PRETTYFLAG_SCHEMA for
the TO
* get_insert_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_update_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_delete_query_def does not handle pretty printing (always uses
generate_relation_name)
* get_rule_expr does not handle pretty printing (always uses
generate_relation_name)
* get_from_clause_item does not handle pretty printing (always uses
generate_relation_name)

Looking at that, it seems we didn't make the effort for the view related
code with all its complexity, and didn't do it for pg_get_statisticsobjdef
since it doesn't have a pretty flag. Why we didn't do it in
pg_get_triggerdef_worker for FROM isn't clear to me.

If we want to be entirely consistent (and keep supporting
PRETTYFLAG_SCHEMA), that probably means:

* Adding a pretty flag to pg_get_statisticsobjdef
* Teaching get_query_def to pass down prettyFlags to get_*_query_def
functions
* Update pg_get_triggerdef_worker to handle pretty for FROM as well

If that seems like a sensible direction I'd be happy to work on a patch.

Thanks,
Lukas

-- 
Lukas Fittl


pg_get_constraintdef: Schema qualify foreign tables unless pretty printing is enabled

2022-08-09 Thread Lukas Fittl
Hi hackers,

Whilst debugging an issue with the output of pg_get_constraintdef, we've
discovered that pg_get_constraintdef doesn't schema qualify foreign tables
mentioned in the REFERENCES clause, even if pretty printing
(PRETTYFLAG_SCHEMA) is turned off.

This is a problem because it means there is no way to get a constraint
definition that can be recreated on another system when multiple schemas
are in use, but a different search_path is set. It's also different from
pg_get_indexdef, where this flag is correctly respected.

I assume this is an oversight, since the fix is pretty straightforward, see
attached patch. I'll register the patch for the next commitfest.

Here is a test case from my colleague Maciek showing this difference:

create schema s;
create table s.foo(a int primary key);
create table s.bar(a int primary key, b int references s.foo(a));

select pg_get_indexdef(indexrelid, 0, false) from pg_index order by
indexrelid desc limit 3;

pg_get_indexdef

---
 CREATE UNIQUE INDEX bar_pkey ON s.bar USING btree (a)
 CREATE UNIQUE INDEX foo_pkey ON s.foo USING btree (a)
 CREATE UNIQUE INDEX pg_toast_13593_index ON pg_toast.pg_toast_13593 USING
btree (chunk_id, chunk_seq)
(3 rows)

select pg_get_constraintdef(oid, false) from pg_constraint order by oid
desc limit 3;
   pg_get_constraintdef
---
 FOREIGN KEY (b) REFERENCES foo(a)
 PRIMARY KEY (a)
 PRIMARY KEY (a)
(3 rows)

Thanks,
Lukas

-- 
Lukas Fittl
From 83a1ab6081f70d0eed820b2b13bc3ab6e93af8ec Mon Sep 17 00:00:00 2001
From: Lukas Fittl 
Date: Tue, 9 Aug 2022 16:55:35 -0700
Subject: [PATCH v1] pg_get_constraintdef: Schema qualify foreign tables by
 default

This matches pg_get_constraintdef to behave the same way as
pg_get_indexdef, which is to schema qualify referenced objects in the
definition, unless pretty printing is explicitly requested.

For pretty printing the previous behaviour is retained, which is to only
include the schema information if the referenced object is not in the
current search path.
---
 src/backend/utils/adt/ruleutils.c | 5 +++--
 src/test/regress/expected/foreign_key.out | 8 
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/src/backend/utils/adt/ruleutils.c b/src/backend/utils/adt/ruleutils.c
index d575aa0066..b7a2a356b4 100644
--- a/src/backend/utils/adt/ruleutils.c
+++ b/src/backend/utils/adt/ruleutils.c
@@ -2249,8 +2249,9 @@ pg_get_constraintdef_worker(Oid constraintId, bool fullCommand,
 
 /* add foreign relation name */
 appendStringInfo(, ") REFERENCES %s(",
- generate_relation_name(conForm->confrelid,
-		NIL));
+ (prettyFlags & PRETTYFLAG_SCHEMA) ?
+ generate_relation_name(conForm->confrelid, NIL) :
+ generate_qualified_relation_name(conForm->confrelid));
 
 /* Fetch and build referenced-column list */
 val = SysCacheGetAttr(CONSTROID, tup,
diff --git a/src/test/regress/expected/foreign_key.out b/src/test/regress/expected/foreign_key.out
index da26f083bc..a70f7c2491 100644
--- a/src/test/regress/expected/foreign_key.out
+++ b/src/test/regress/expected/foreign_key.out
@@ -775,10 +775,10 @@ CREATE TABLE FKTABLE (
   FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES PKTABLE ON DELETE SET DEFAULT (fk_id_del_set_default)
 );
 SELECT pg_get_constraintdef(oid) FROM pg_constraint WHERE conrelid = 'fktable'::regclass::oid ORDER BY oid;
-pg_get_constraintdef
-
- FOREIGN KEY (tid, fk_id_del_set_null) REFERENCES pktable(tid, id) ON DELETE SET NULL (fk_id_del_set_null)
- FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES pktable(tid, id) ON DELETE SET DEFAULT (fk_id_del_set_default)
+   pg_get_constraintdef
+---
+ FOREIGN KEY (tid, fk_id_del_set_null) REFERENCES public.pktable(tid, id) ON DELETE SET NULL (fk_id_del_set_null)
+ FOREIGN KEY (tid, fk_id_del_set_default) REFERENCES public.pktable(tid, id) ON DELETE SET DEFAULT (fk_id_del_set_default)
 (2 rows)
 
 INSERT INTO PKTABLE VALUES (1, 0), (1, 1), (1, 2);
-- 
2.34.0



Re: log_line_prefix: make it possible to add the search_path

2022-07-25 Thread Lukas Fittl
On Mon, Jul 25, 2022 at 12:38 AM Pierre Ducroquet 
wrote:

> usecase by not showing the schema, one of them being log_line_prefix.
> It is possible to work around this using the application_name, but a
> mistake
> on the application side would be fatal, while the search_path would still
> indicate the real tables used in a query.
>

I'm assuming this is mostly referring to STATEMENT log lines and other
situations where the original query is output (e.g. auto_explain).

+1 on the benefit of solving this (I've had this use case before), but I
think we can keep this more specific than a general log_line_prefix option.
The search_path isn't relevant to any log line that doesn't reference a
query, since e.g. autovacuum log output fully qualifies its relation names,
and many other common log lines have nothing to do with tables or queries.

What if we instead had something like this, as an extra CONTEXT (or DETAIL)
log line:

LOG: duration: 4079.697 ms execute :
SELECT * FROM x WHERE y = $1 LIMIT $2
DETAIL: parameters: $1 = 'long string', $2 = '1'
CONTEXT: settings: search_path = 'my_tenant_schema, "$user", public'

That way you could determine that the slow query was affecting the "x"
table in "my_tenant_schema".

This log output would be controlled by a new GUC, e.g.
"log_statement_search_path" with three settings: (1) never, (2)
non_default, (3) always.

The default would be "never" (same as today). "non_default" would output
the search path when a SET has modified it in the current session (and so
we couldn't infer it from the config or the role/database overrides).
"always" would always output the search path for statement-related log
lines.

Thanks,
Lukas

-- 
Lukas Fittl


Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

2022-07-01 Thread Lukas Fittl
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund  wrote:

> The attached patches are really just a prototype. I'm also not really
> planning to work on getting this into a "production ready" patchset
> anytime soon. I developed it primarily because I found it the overhead
> made it too hard to nail down in which part of a query tree performance
> changed.  If somebody else wants to continue from here...
>
> I do think it's be a pretty significant improvement if we could reduce
> the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
> bunch of low-level code.
>

Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.

See attached an updated patch (adding it to the July commitfest), with a few
changes:

- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
  INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
  because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
  are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
  quite sophisticated logic around making this decision, see [1])

Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2].

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
  the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
  reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
  higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
  side, only when rdtsc measurement is used -- its likely there is a problem
  somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec
variable
  is problematic in terms of precision (see FIXME, taken over from Andres'
POC)

Open questions from me:

1) Do we need to account for different TSC offsets on different CPUs in SMP
   systems? (the Linux kernel certainly has logic to that extent, but [3]
   suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
   having an invariant TSC)

2) Should we have a setting "--with-tsc" for configure? (instead of always
   enabling it when on Linux/x86 with a TSC clocksource)

3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
   current instructions to finish -- the prior discussion seemed to suggest
   we don't want it for node instruction measurements, but possibly we do
want
   this in other cases?)

4) Should we support using the "mrs" instruction on ARM? (which is similar
to
   rdtsc, see [4])

Thanks,
Lukas

[1] https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2] http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3] https://stackoverflow.com/a/11060619/1652607
[4] https://cpufun.substack.com/p/fun-with-timers-and-cpuid

-- 
Lukas Fittl


v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patch
Description: Binary data


v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patch
Description: Binary data


Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl
On Wed, Apr 6, 2022 at 12:34 PM Justin Pryzby  wrote:

> On Wed, Apr 06, 2022 at 12:27:34PM -0700, Andres Freund wrote:
> > > > +   next use of statistical information will cause a new snapshot to
> be built
> > > > +   or accessed statistics to be cached.
> > >
> > > I believe this should be an "and", not an "or". (next access builds
> both a
> > > new snapshot and caches accessed statistics)
> >
> > I *think* or is correct? The new snapshot is when
> stats_fetch_consistency =
> > snapshot, the cached is when stats_fetch_consistency = cache. Not sure
> how to
> > make that clearer without making it a lot longer. Suggestions?
>
> I think it's correct.  Maybe it's clearer to say:
>
> +   next use of statistical information will (when in snapshot mode) cause
> a new snapshot to be built
> +   or (when in cache mode) accessed statistics to be cached.
>

Ah, yes, that does clarify what was meant.

+1 to Justin's edit, or something like it.

Thanks,
Lukas

-- 
Lukas Fittl


Re: shared-memory based stats collector - v70

2022-04-06 Thread Lukas Fittl
On Tue, Apr 5, 2022 at 8:00 PM Andres Freund  wrote:

> Here comes v70:
>

Some small nitpicks on the docs:

> From 13090823fc4c7fb94512110fb4d1b3e86fb312db Mon Sep 17 00:00:00 2001
> From: Andres Freund 
> Date: Sat, 2 Apr 2022 19:38:01 -0700
> Subject: [PATCH v70 14/27] pgstat: update docs.
> ...
> diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
> -  These parameters control server-wide statistics collection
features.
> -  When statistics collection is enabled, the data that is produced
can be
> +  These parameters control server-wide cumulative statistics system.
> +  When enabled, the data that is collected can be

Missing "the" ("These parameters control the server-wide cumulative
statistics system").

> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> +   any of the accumulated statistics, acessed values are cached until
the end

"acessed" => "accessed"

> +   stats_fetch_consistency can be set
> +   snapshot, at the price of increased memory usage
for

Missing "to" ("can be set to snapshot")

> +   caching not-needed statistics data.  Conversely, if it's known that
statistics

Double space between "data." and "Conversely" (not sure if that matters)

> +   current transaction's statistics snapshot or cached values (if any).
The

Double space between "(if any)." and "The" (not sure if that matters)

> +   next use of statistical information will cause a new snapshot to be
built
> +   or accessed statistics to be cached.

I believe this should be an "and", not an "or". (next access builds both a
new snapshot and caches accessed statistics)

Thanks,
Lukas

-- 
Lukas Fittl


Re: [PATCH] Add extra statistics to explain for Nested Loop

2022-01-06 Thread Lukas Fittl
On Sun, Nov 21, 2021 at 8:55 PM Justin Pryzby  wrote:

> I'm curious to hear what you and others think of the refactoring.
>
> It'd be nice if there's a good way to add a test case for verbose output
> involving parallel workers, but the output is unstable ...
>

I've reviewed this patch, and it works as expected - the refactoring
changes by Justin also appear to make sense to me.

I've briefly thought whether this needs documentation (currently the patch
includes none), but there does not appear to be a good place to add
documentation about this from a quick glance, so it seems acceptable to
leave this out given the lack of more detailed EXPLAIN documentation in
general.

The one item that still feels a bit open to me is benchmarking, based on
Andres' comment a while ago:

On Mon, Oct 19, 2020 at 4:20 PM Andres Freund  wrote:

> I'm a bit worried that further increasing the size of struct
> Instrumentation will increase the overhead of EXPLAIN ANALYZE further -
> in some workloads we spend a fair bit of time in code handling that. It
> would be good to try to find a few bad cases, and see what the overhead is.
>

Whilst no specific bad cases were provided, I wonder if even a simple
pgbench with auto_explain (and log_analyze=1) would be a way to test this?

The overhead of the Instrumentation struct size should show regardless of
whether a plan actually includes a Nested Loop.

Thanks,
Lukas

-- 
Lukas Fittl


Re: PostgreSQL 13 Beta 1 Release Announcement Draft

2020-05-19 Thread Lukas Fittl
On Mon, May 18, 2020 at 7:29 PM Jonathan S. Katz 
wrote:

> Attached is a draft of the release announcement for the PostgreSQL 13
> Beta 1 release this week.
>

We could call out the additional commits that Tom has done for wait event
renaming, re: compatibility - next to "Rename some recovery-related wait
events".

These are the relevant commits, I think:

Rename SLRU structures and associated LWLocks.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5da14938f7bfb96b648ee3c47e7ea2afca5bcc4a

Rename assorted LWLock tranches.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=36ac359d3621578cefc2156a3917024cdd3b1829

Drop the redundant "Lock" suffix from LWLock wait event names.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=14a91010912632cae322b06fce0425faedcf7353

Mop-up for wait event naming issues.
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=3048898e73c75f54bb259323382e0e7f6368cb6f

Thanks,
Lukas

-- 
Lukas Fittl


Procedure calls are not tracked in pg_stat_user_functions / track_functions

2018-10-04 Thread Lukas Fittl
Hi all,

It seems that currently procedures do not get tracked when track_functions
is enabled, which means one needs to resort to other workarounds in order
to monitor procedure calls/runtime.

To illustrate:

=# SHOW track_functions;
┌─┐
│ track_functions │
├─┤
│ all │
└─┘
(1 row)

=# CALL abc();
CALL

=# SELECT def();
┌─┐
│ def │
├─┤
│ │
└─┘
(1 row)

=# SELECT * FROM pg_stat_user_functions;
┌─[ RECORD 1 ]┐
│ funcid │ 75223  │
│ schemaname │ public │
│ funcname   │ def│
│ calls  │ 1  │
│ total_time │ 3.222  │
│ self_time  │ 3.222  │
└┴┘

Was this intentional, or an oversight?

If welcome, I would be happy to work on a patch. Whilst slightly confusing
in terms of naming, we could just track this together with functions, since
one can always join with pg_proc to determine whether something is a
function or a procedure.

Thanks,
Lukas

-- 
Lukas Fittl


Re: Query is over 2x slower with jit=on

2018-10-01 Thread Lukas Fittl
On Tue, Sep 25, 2018 at 1:17 PM, Andres Freund  wrote:
>
> I've pushed the change without that bit - it's just a few additional
> lines if we want to change that.
>

It seems that since this commit, JIT statistics are now only being printed
for parallel query plans. This is due to ExplainPrintJIT being called
before ExecutorEnd, so in a non-parallel query,
queryDesc->estate->es_jit_combined_instr will never get set.

Attached a patch that instead introduces a new ExplainPrintJITSummary
method that summarizes the statistics before they get printed.

I've also removed an (I believe) unnecessary "if (estate->es_instrument)"
check that prevented EXPLAIN without ANALYZE from showing whether JIT would
be used or not.

In addition this also updates a missed section in the documentation with
the new stats output format.

Best,
Lukas

-- 
Lukas Fittl


fix-jit-statistics-in-EXPLAIN-for-non-parallel-queries-v1.patch
Description: Binary data


Re: auto_explain: Include JIT output if applicable

2018-09-24 Thread Lukas Fittl
On Mon, Sep 24, 2018 at 1:48 PM, Andres Freund  wrote:
>
> Thanks for noticing - pushed!
>

Thanks!

Best,
Lukas

-- 
Lukas Fittl


auto_explain: Include JIT output if applicable

2018-09-24 Thread Lukas Fittl
Hi,

Whilst playing around with auto_explain and JIT today, I realized that
auto_explain currently doesn't output JIT information, which is rather
unfortunate when analyzing a larger set of queries in a semi-automated
manner.

Attached a trivial patch that fixes the issue and adds JIT information to
auto_explain with the same logic as used for regular EXPLAIN.

Thanks,
Lukas

-- 
Lukas Fittl


auto_explain-include-jit-output-v1.patch
Description: Binary data


Re: [HACKERS] Planning counters in pg_stat_statements

2018-03-30 Thread Lukas Fittl
On Tue, Jan 23, 2018 at 3:31 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
> > Unfortunately I'm not going to have bandwidth to figure this out
> > during this commitfest due to other priorities so I'm going to call
> > this "returned with feedback".
>
> OK.  There's still time to get it done in the March 'fest.
>

I've had an interesting incident earlier this week where seeing planning
time in pg_stat_statements would have been very helpful to determine the
root cause more quickly.

Essentially the situation was a statement that executed < 1ms but took more
than 250ms to plan, running often - once we found the statement (using
pg_stat_activity sampling) we were able to fix quickly by rewriting the
query and reduce system load from ~8 to ~1. Having this patch in
pg_stat_statements would have been critical to get the full picture of what
was going on earlier.

Thomas: I'm not as familiar with planner internals as you are, but happy to
try and contribute here. Would it be useful for me to try to adjust the
patch to Tom's proposal?

Best,
Lukas

-- 
Lukas Fittl