Re: insert and query performance on big string table with pg_trgm

2017-11-20 Thread Jeff Janes
On Mon, Nov 20, 2017 at 2:54 PM, Matthew Hall  wrote:

While I have not done exhaustive testing, from the tests I have done I've
never found gist to be better than gin with trgm indexes.


>
> Here is the table:
>
>Unlogged table "public.huge_table"
>Column|   Type   | Collation | Nullable |
>   Default
> -+--+---+---
> ---+---
>  id  | bigint   |   | not null |
> nextval('huge_table_id_seq'::regclass)
>  inserted_ts | timestamp with time zone |   |  |
> transaction_timestamp()
>  value   | character varying|   |  |
> Indexes:
> "huge_table_pkey" PRIMARY KEY, btree (id)
> "huge_table_value_idx" UNIQUE, btree (value)
> "huge_table_value_trgm" gin (value gin_trgm_ops)
>

Do you really need the artificial primary key, when you already have
another column that would be used as the primary key?  If you need to use
this it a foreign key in another type, then very well might.  But
maintaining two unique indexes doesn't come free.

Are all indexes present at the time you insert?  It will probably be much
faster to insert without the gin index (at least) and build it after the
load.

Without knowing this key fact, it is hard to interpret the rest of your
data.


>
> I managed to load the table initially in about 9 hours, after doing some
> optimizations below based on various documentation (the server is 8-core
> Xeon
> E5504, 16 GB RAM, 4 Hitachi 1TB 7200 RPM in a RAID 5 via Linux MD):
>  ...



>
>
* maintenance_work_mem 512 MB
>

Building a gin index in bulk could benefit from more memory here.

* synchronous_commit off
>

If you already are using unlogged tables, this might not be so helpful, but
does increase the risk of the rest of your system.



>   PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+ COMMAND
> 29578 postgres  20   0 6575672 6.149g 6.139g R  86.0 39.7  45:24.97
> postgres
>

You should expand the command line (by hitting 'c', at least in my version
of top) so we can see which postgres process this is.


>
> As for queries, doing a simple query like this one seems to require around
> 30
> seconds to a minute. My volume is not crazy high but I am hoping I could
> get
> this down to less than 30 seconds, because other stuff above this code will
> start to time out otherwise:
>
> osint=# explain analyze select * from huge_table where value ilike
> '%keyword%';
>

explain (analyze, buffers), please.  And hopefully with track_io_timing=on.

If you repeat the same query, is it then faster, or is it still slow?

Cheers,

Jeff


Re: insert and query performance on big string table with pg_trgm

2017-11-24 Thread Jeff Janes
On Nov 21, 2017 00:05, "Matthew Hall"  wrote:


> Are all indexes present at the time you insert?  It will probably be much
faster to insert without the gin index (at least) and build it after the
load.

There is some flexibility on the initial load, but the updates in the
future will require the de-duplication capability. I'm willing to accept
that might be somewhat slower on the load process, to get the accurate
updates, provided we could try meeting the read-side goal I wrote about, or
at least figure out why it's impossible, so I can understand what I need to
fix to make it possible.


As long as you don't let anyone use the table between the initial load and
when the index build finishes, you don't have to compromise on
correctness.  But yeah, makes sense to worry about query speed first.






> If you repeat the same query, is it then faster, or is it still slow?

If you keep the expression exactly the same, it still takes a few seconds
as could be expected for such a torture test query, but it's still WAY
faster than the first such query. If you change it out to a different
expression, it's longer again of course. There does seem to be a
low-to-medium correlation between the number of rows found and the query
completion time.


To make this quick, you will need to get most of the table and most of the
index cached into RAM.  A good way to do that is with pg_prewarm.  Of
course that only works if you have enough RAM in the first place.

What is the size of the table and the gin index?


Cheers,

Jeff


Re: Setting effective_io_concurrency in VM?

2017-11-27 Thread Jeff Janes
On Mon, Nov 27, 2017 at 10:40 AM, Scott Marlowe 
wrote:

>
> Generally VMs are never going to be as fast as running on bare metal
> etc. You can adjust it and test it with something simple like pgbench
> with various settings for -c (concurrency) and see where it peaks etc
> with the setting. This will at least get you into the ball park.
>

None of the built-in workloads for pgbench cares a whit about
effective_io_concurrency.  He would have to come up with some custom
transactions to exercise that feature.  (Or use the tool people use to run
the TPCH benchmark, rather than using pgbench's built in transactions)

I think the best overall advice would be to configure it the same as you
would if it were not a VM.  There may be cases where you diverge from that,
but I think each one would require extensive investigation and
experimentation, so can't be turned into a rule of thumb.

Cheers,

Jeff


Re: Bitmap scan is undercosted?

2017-12-01 Thread Jeff Janes
On Fri, Dec 1, 2017 at 3:54 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

> On 02/12/2017 01:11, Justin Pryzby wrote:
>
>> I tried to reproduce this issue and couldn't, under PG95 and 10.1:
>>
>> On Fri, Dec 01, 2017 at 12:34:27PM -0600, Justin Pryzby wrote:
>>
>>> On Fri, Dec 01, 2017 at 07:40:08PM +0200, Vitaliy Garnashevich wrote:
>>>
 We recently had an issue in production, where a bitmap scan was chosen
 instead of an index scan. Despite being 30x slower, the bitmap scan had
 about the same cost as the index scan.
 drop table if exists aaa;
 create table aaa as select (id%100)::int num, (id%10=1)::bool flag from
 generate_series(1, 1000) id;
 create index i1 on aaa  (num);
 create index i2 on aaa  (flag);
 analyze aaa;

>>> What is:
>> effective_io_concurrency
>> max_parallel_workers_per_gather (I gather you don't have this)
>>
> effective_io_concurrency = 0
> max_parallel_workers_per_gather = 0
>
> Did you notice random_page_cost = 1.5 ?
>

For the aaa.num = 39 case, the faster index scan actually does hit 15 times
more buffers than the bitmap scan does.  While 1.5 is lot lower than 4.0,
it is still much higher than the true cost of reading a page from the
buffer cache.   This why the index scan is getting punished.  You could
lower random_page_cost and  seq_page_cost to 0, to remove those
considerations.  (I'm not saying you should do this on your production
system, but rather you should do it as a way to investigate the issue.  But
it might make sense on production as well)


> For this test I'm using SSD and Windows (if that matters). On production
> we also use SSD, hence lower random_page_cost. But with the default
> random_page_cost=4.0, the difference in cost between the index scan plan
> and the bitmap scan plan is even bigger.


Since it is all shared buffers hits, it doesn't matter if you have SSD for
this particular test case.

Cheers,

Jeff


Re: Bitmap scan is undercosted?

2017-12-02 Thread Jeff Janes
On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
>
> seq_page_cost = 0.0
> random_page_cost = 0.0
> explain analyze select * from aaa where num = 2 and flag = true;
>
> Bitmap Heap Scan on aaa  (cost=753.00..2003.00 rows=10257 width=5) (actual
> time=82.212..82.212 rows=0 loops=1)
>   ->  Bitmap Index Scan on i1  (cost=0.00..750.43 rows=10 width=0)
> (actual time=17.401..17.401 rows=10 loops=1)
>
> Index Scan using i1 on aaa  (cost=0.44..1750.43 rows=10257 width=5)
> (actual time=49.766..49.766 rows=0 loops=1)
>
> The bitmap plan was reduced to use only one bitmap scan, and finally it
> costs more than the index plan.
>

Right, so there is a cpu costing problem (which could only be fixed by
hacking postgresql and recompiling it), but it is much smaller of a problem
than the IO cost not being accurate due to the high hit rate.  Fixing the
CPU costing problem is unlikely to make a difference to your real query.
If you set the page costs to zero, what happens to your real query?


> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
> should actually be used.
>

Why not?  If your production server really has everything in memory during
normal operation, that is the correct course of action.  If you ever
restart the server, then you could have some unpleasant time getting it
back up to speed again, but pg_prewarm could help with that.


> Probably it should be instead something like 1.0/1.0 or 1.0/1.1, but other
> costs increased, to have more weight.
>

This doesn't make any  sense to me.  Halving the page costs is
mathematically the same as doubling all the other constants.  But the first
way of doing things says what you are doing, and the second way is an
obfuscation of what you are doing.


>
> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> set seq_page_cost = 1.0;
> set random_page_cost = 1.0;
> set cpu_tuple_cost = 0.04;
> set cpu_index_tuple_cost = 0.02;
> set cpu_operator_cost = 0.01;
>

If you really want to target the plan with the BitmapAnd, you should
increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
cpu_tuple_cost.  That is because the  unselective bitmap index scan does
not incur any cpu_tuple_cost, but does incur index_tuple and operator
costs.  Unfortunately all other index scans in the system will also be
skewed by such a change if you make the change system-wide.

Incidentally, the "actual rows" field of BitmapAnd is always zero.  That
field is not implemented for that node type.


Why do you have an index on flag in the first place?  What does the index
accomplish, other than enticing the planner into bad plans?  I don't know
how this translates back into your real query, but dropping that index
should be considered.  Or replace both indexes with one on (num,flag).

Or you can re-write the part of the WHERE clause in a way that it can't use
an index, something like:

and flag::text ='t'

Cheers,

Jeff


Re: Bitmap scan is undercosted?

2017-12-02 Thread Jeff Janes
On Sat, Dec 2, 2017 at 3:44 PM, Tom Lane  wrote:

> Jeff Janes  writes:
> > On Fri, Dec 1, 2017 at 11:08 PM, Vitaliy Garnashevich <
> > vgarnashev...@gmail.com> wrote:
> >> # x4 tuple/operator costs - bitmap scan still a bit cheaper
> >> set seq_page_cost = 1.0;
> >> set random_page_cost = 1.0;
> >> set cpu_tuple_cost = 0.04;
> >> set cpu_index_tuple_cost = 0.02;
> >> set cpu_operator_cost = 0.01;
>
> > If you really want to target the plan with the BitmapAnd, you should
> > increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
> > cpu_tuple_cost.  That is because the  unselective bitmap index scan does
> > not incur any cpu_tuple_cost, but does incur index_tuple and operator
> > costs.  Unfortunately all other index scans in the system will also be
> > skewed by such a change if you make the change system-wide.
>
> I think it'd be a serious error to screw around with your cost settings
> on the basis of a single case in which the rowcount estimates are so
> far off.  It's really those estimates that are the problem AFAICS.
>
> The core issue in this example is that, the way the test data is set up,
> the "flag = true" condition actually adds no selectivity at all, because
> every row with "num = 1" is certain to have "flag = true".  If the planner
> realized that, it would certainly not bother with BitmapAnd'ing the flag
> index onto the results of the num index.  But it doesn't know that those
> columns are correlated, so it supposes that adding the extra index will
> give a 10x reduction in the number of heap rows that have to be visited
> (since it knows that only 1/10th of the rows have "flag = true").
> *That* is what causes the overly optimistic cost estimate for the
> two-index bitmapscan, and no amount of fiddling with the cost parameters
> will make that better.
>


But he also tested with num=2 and num=39, which reverses the situation so
the bitmap is 100% selective rather than the 90% the planner thinks it will
be.

But it is still slower for him (I am having trouble replicating that exact
behavior), so building the bitmap to rule out 100% of the rows is
empirically not worth it, I don't see how building it to rule out 90%, as
the planner things, would be any better.


> I tried creating multiple-column statistics using the v10 facility for
> that:
>
> regression=# create statistics s1 on num, flag from aaa;
> CREATE STATISTICS
> regression=# analyze aaa;
> ANALYZE
>
> but that changed the estimate not at all, which surprised me because
> dependency statistics are supposed to fix exactly this type of problem.
> I suspect there may be something in the extended-stats code that causes it
> not to work right for boolean columns --- this wouldn't be excessively
> surprising because of the way the planner messes around with converting
> "flag = true" to just "flag" and sometimes back again.  But I've not
> looked closer yet.
>

I think the non-extended stats code also has trouble with booleans.
pg_stats gives me a correlation  of 0.8 or higher for the flag column.

Due to that, when I disable bitmapscans and seqscans, I start getting slow
index scans on the wrong index, i2 rather than i1.  I don't know why he
doesn't see that in his example.

Cheers,

Jeff


Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes
On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby  wrote:

> On Sat, Dec 02, 2017 at 05:27:51PM -0800, Jeff Janes wrote:
> > I think the non-extended stats code also has trouble with booleans.
> > pg_stats gives me a correlation  of 0.8 or higher for the flag column.
>
> It's not due to the boolean though; you see the same thing if you do:
> CREATE INDEX aaa_f ON aaa((flag::text));
> ANALYZE aaa;
> correlation | 0.81193
>
> or:
> ALTER TABLE aaa ADD flag2 int; UPDATE aaa SET flag2= flag::int
> correlation | 0.81193
>
> I think it's caused by having so few (2) values to correlate.
>
> most_common_vals   | {f,t}
> most_common_freqs  | {0.9014,0.0986}
> correlation| 0.822792
>
> It thinks there's somewhat-high correlation since it gets a list of x and y
> values (integer positions by logical and physical sort order) and 90% of
> the x
> list (logical value) are the same value ('t'), and the CTIDs are in order
> on
> the new index, so 90% of the values are 100% correlated.
>

But there is no index involved (except in the case of the functional
index).  The correlation of table columns to physical order of the table
doesn't depend on the existence of an index, or the physical order within
an index.

But I do see that ties within the logical order of the column values are
broken to agree with the physical order.  That is wrong, right?  Is there
any argument that this is desirable?

It looks like it could be fixed with a few extra double calcs per distinct
value.  Considering we already sorted the sample values using SQL-callable
collation dependent comparators, I doubt a few C-level double calcs is
going to be meaningful.

Cheers,

Jeff


Re: Bitmap scan is undercosted? - boolean correlation

2017-12-03 Thread Jeff Janes
On Dec 3, 2017 15:31, "Tom Lane"  wrote:

Jeff Janes  writes:
> On Sat, Dec 2, 2017 at 8:04 PM, Justin Pryzby 
wrote:
>> It thinks there's somewhat-high correlation since it gets a list of x
>> and y values (integer positions by logical and physical sort order) and
>> 90% of the x list (logical value) are the same value ('t'), and the
>> CTIDs are in order on the new index, so 90% of the values are 100%
>> correlated.

> But there is no index involved (except in the case of the functional
> index).  The correlation of table columns to physical order of the table
> doesn't depend on the existence of an index, or the physical order within
> an index.

> But I do see that ties within the logical order of the column values are
> broken to agree with the physical order.  That is wrong, right?  Is there
> any argument that this is desirable?

Uh ... what do you propose doing instead?  We'd have to do something with
ties, and it's not so obvious this way is wrong.


Let them be tied.  If there are 10 distinct values, number the values 0 to
9, and all rows of a given distinct values get the same number for the
logical order axis.

Calling the correlation 0.8 when it is really 0.0 seems obviously wrong to
me.  Although if we switched btree to store duplicate values with tid as a
tie breaker, then maybe it wouldn't be as obviously wrong.

Cheers,

Jeff


Re: Bitmap scan is undercosted?

2017-12-06 Thread Jeff Janes
On Sun, Dec 3, 2017 at 1:15 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

> On 02/12/2017 23:17, Jeff Janes wrote:
>
> Right, so there is a cpu costing problem (which could only be fixed by
> hacking postgresql and recompiling it), but it is much smaller of a problem
> than the IO cost not being accurate due to the high hit rate.  Fixing the
> CPU costing problem is unlikely to make a difference to your real query.
> If you set the page costs to zero, what happens to your real query?
>
> I can't reproduce the exact issue on the real database any more. The query
> started to use the slow bitmap scan recently, and had been doing so for
> some time lately, but now it's switched back to use the index scan. The
> table involved in the query gets modified a lot. It has hundreds of
> millions of rows. Lots of new rows are appended to it every day, the oldest
> rows are sometimes removed. The table is analyzed at least daily. It's
> possible that statistics was updated and that caused the query to run
> differently. But I still would like to understand why that issue happened,
> and how to properly fix it, in case the issue returns.
>

While your test case displays some cost estimation issues, there is really
no reason to think that they are the same issues your real query shows.
Particularly since you said the difference was a factor of 30 in the real
case, rather than 3.  Any chance you can show EXPLAIN ANALYZE output for
the real query, but when it is acting up and when it is not?  Something in
the plans might stand out to us as the obvious problem.  On the other hand,
maybe nothing will stand out without having a replicable test case.  The
only way to know is to try.


>
>
>
>> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
>> should actually be used.
>>
>
> Why not?  If your production server really has everything in memory during
> normal operation, that is the correct course of action.  If you ever
> restart the server, then you could have some unpleasant time getting it
> back up to speed again, but pg_prewarm could help with that.
>
> In the real database, not everything is in memory. There are 200GB+ of
> RAM, but DB is 500GB+. The table involved in the query itself is 60GB+ of
> data and 100GB+ of indexes. I'm running the test case in a way where all
> reads are done from RAM, only to make it easier to reproduce and to avoid
> unrelated effects.
>

Is everything that the particular query in questions needs in memory, even
if other queries need things from disk?  Or does the problematic query also
need things from disk?  If the query does need to read things from disk,
the bitmap actually should be faster.  Which reinforces the idea that maybe
the issue brought up by your test case is not the same as the issue brought
up by your real case, even if they both point in the same direction.


> As far as know, costs in Postgres were designed to be relative to
> seq_page_cost, which for that reason is usually defined as 1.0. Even if
> everything would be in RAM, accesses to the pages would still not have zero
> cost. Setting 0.0 just seems too extreme, as all other non-zero costs would
> become infinitely bigger.
>

When exploring things, 0.0 certain helps to simplify things.  Yes, 0.05 or
something similar might be better for a completely cached database.  The
problem is that it is very  context dependent.  Reading a page from
shared_buffers when there is no contention from other processes for the
same page is probably less than 0.01.  If it is not in shared_buffers but
is in effective_cache_size, it is probably a few multiples of 0.01.  If
there is contention either for that specific page, or for available buffers
into which to read pages, then it could be substantially higher yet.
Higher, none of those are things the planner is aware of.

If you really want to target the plan with the BitmapAnd, you should
> increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
> cpu_tuple_cost.  That is because the  unselective bitmap index scan does
> not incur any cpu_tuple_cost, but does incur index_tuple and operator
> costs.  Unfortunately all other index scans in the system will also be
> skewed by such a change if you make the change system-wide.
>
> Exactly. I'd like to understand why the worse plan is being chosen, and 1)
> if it's fixable by tuning costs, to figure out the right settings which
> could be used in production, 2) if there is a bug in Postgres optimizer,
> then to bring some attention to it, so that it's eventually fixed in one of
> future releases, 3) if Postgres is supposed to work this way, then at least
> I (and people who ever read this thread) would understand it better.
>

I  would argue that it is planner "bug", (quo

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-06 Thread Jeff Janes
On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane  wrote:

> Jeff Janes  writes:
> > On Dec 3, 2017 15:31, "Tom Lane"  wrote:
> >> Jeff Janes  writes:
> >>> But I do see that ties within the logical order of the column values
> are
> >>> broken to agree with the physical order.  That is wrong, right?  Is
> there
> >>> any argument that this is desirable?
>
> >> Uh ... what do you propose doing instead?  We'd have to do something
> with
> >> ties, and it's not so obvious this way is wrong.
>
> > Let them be tied.  If there are 10 distinct values, number the values 0
> to
> > 9, and all rows of a given distinct values get the same number for the
> > logical order axis.
> > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong
> to
> > me.  Although if we switched btree to store duplicate values with tid as
> a
> > tie breaker, then maybe it wouldn't be as obviously wrong.
>
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.


The correlation is used in another place, estimating how much of the table
we will visit in the first place.  If the correlation is very high, then
scanning 10% of the index leaf pages means we will visit 10% of the table.
If the correlation is low, then we use Mackert and Lohman, and (in the case
of visiting 10% of the index) predict we will visit most of the table.
Assuming effective_cache_size is high, we will visit most of the table just
once, but still in a random order, because subsequent visits for the same
query will be found in the cache.  Rather than visiting the various pages
repeatedly and not finding them in cache each time.

In addition to estimating how much of the table we visit, we also estimate
how "sequential like" those visits are.  Which is the use that you
describe.  Ideally for that use case, we would know for each distinct
value, how correlated the tids are with the leaf page ordering.  If the
index is freshly built, that is very high.  We visit 1/10 of the index,
which causes us to visit 100% of the table but in perfect order, plucking
1/10 of the tuples from each table page.

But visiting 100% of the table in physical order in order to pluck out 10%
of the tuples from each page is quite different than visiting 10% of the
table pages in physical order to pluck out 100% of the tuples from those
pages and 0% from the pages not visited.

...

BTW, I disagree that "correlation = zero" is the right answer for this
> particular example.  If the btree is freshly built, then an index-order
> scan would visit all the heap pages in sequence to fetch "f" rows, and
> then visit them all in sequence again to fetch "t" rows, which is a whole
> lot better than the purely random access that zero correlation implies.
> So I think 0.8 or so is actually a perfectly reasonable answer when the
> index is fresh.  The trouble is just that it'd behoove us to derate that
> answer somewhat for the probability that the index isn't fresh.
>

But, for the case of "how much of the table do we visit at all",
correlation = zero is the right answer, even if it isn't the right answer
for "how sequentially do we visit whatever we visit"



> My first thought for a concrete fix was to use the mean position of
> a group of duplicates for the purposes of the correlation calculation,
> but on reflection that's clearly wrong.  We do have an idea, from the
> data we have, whether the duplicates are close together in the heap
> or spread all over.  Using only mean position would fail to distinguish
> those cases, but really we'd better penalize the spread-all-over case.
> I'm not sure how to do that.
>

Departing from correlations, we could also try to estimate "How many
different table pages does each index leaf page reference".  This could
capture functional dependencies which are strong, but not in the form of
linear correlations.  (The current extended statistics only captures
dependencies between user columns, not between one user column and one
system column such as table slot)

For whatever its worth, here is my "let ties be ties" patch.

It breaks two regression tests due to plan changes, and both are cases
where maybe the plan ought to change for the very reason being discussed.
If I just put random gibberish into the correlation field, more regression
tests fail, so I think my implementation is not too far broken.

The accumulations into corr_ysum and corr_y2sum could trivially be pushed
down into the "if", and corr_xysum could as well with a little algebra.
But that seems like premature optimization for a proof-of-concept patch.


Cheers,


Re: Bitmap scan is undercosted?

2017-12-11 Thread Jeff Janes
On Tue, Dec 5, 2017 at 11:06 PM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:


This is very cool, thanks.


> I've tried to create a better test case:
> - Increase shared_buffers and effective_cache_size to fit whole database,
> including indexes.
> - Use random(), to avoid correlation between the filtered values.
> - Make both columns of integer type, to avoid special cases with boolean
> (like the one happened with CREATE STATISTICS).
> - Flush OS disk cache and then try running the query several times, to get
> both cold-cache results and all-in-ram results.
> - There are several tests, with different frequency of the selected values
> in the two columns: [1/10, 1/10], [1/50, 1/10], [1/100, 1/10].
> - There is a split of cost by contribution of each of its components:
> seq_page_cost, random_page_cost, cpu_tuple_cost, cpu_index_tuple_cost,
> cpu_operator_cost. The EXPLAIN is run for each component, every time with
> only one of the components set to non-zero.
>

Where you have question marks, that means you could not force it into the
plan you wanted with all-but-one settings being zero?



> - The test was run on a Digitalocean VM: Ubuntu 16.04.3 LTS (GNU/Linux
> 4.4.0-101-generic x86_64), 2 GB RAM,  2 core CPU, SSD; PostgreSQL 9.5.10.
>
>
> shared_buffers = 1024MB
> effective_cache_size = 1024MB
>

I would set this even higher.



>
> work_mem = 100MB
>
> create table aaa as select floor(random()*10)::int num, (random()*10 <
> 1)::int flag from generate_series(1, 1000) id;
> create table aaa as select floor(random()*50)::int num, (random()*10 <
> 1)::int flag from generate_series(1, 1000) id;
> create table aaa as select floor(random()*100)::int num, (random()*10 <
> 1)::int flag from generate_series(1, 1000) id;
>
> create index i1 on aaa  (num);
> create index i2 on aaa  (flag);
>
> set enable_bitmapscan = on; set enable_indexscan = off;  set
> enable_seqscan = off;
> set enable_bitmapscan = off; set enable_indexscan = on;  set
> enable_seqscan = off;
> set enable_bitmapscan = off; set enable_indexscan = off;  set
> enable_seqscan = on;
>
> set seq_page_cost = 1.0; set random_page_cost = 1.0; set cpu_tuple_cost =
> 0.01; set cpu_index_tuple_cost = 0.005; set cpu_operator_cost = 0.0025;
>
> explain (analyze,verbose,costs,buffers) select * from aaa where num = 1
> and flag = 1;
>

One thing to try is to use explain (analyze, timing off), and then get the
total execution time from the summary line at the end of the explain,
rather than from "actual time" fields.  Collecting the times of each
individual step in the execution can impose a lot of overhead, and some
plans have more if this artificial overhead than others.  It might change
the results, or it might not.  I know that sorts and hash joins are very
sensitive to this, but I don't know about bitmap scans.



What seems odd to me is that in different kinds of tests (with different
> frequency of column values):
>
> i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
> i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
>

To filter out 89762 tuples, you first have to look them up in the table,
and since they are randomly scattered that means you hit nearly every page
in the table at least once.  In fact, I don't understand how the empirical
number of buffers hits can be only 46983 in the first case, when it has to
visit 1,000,000 rows (and reject 90% of them).  I'm guessing that it is
because your newly created index is sorted by ctid order within a given
index value, and that the scan holds a pin on the table page between
visits, and so doesn't count as a hit if it already holds the pin.

You could try to create an empty table, create the indexes, then populate
the table with your random select query, to see if that changes the buffer
hit count.  (Note that this wouldn't change the cost estimates much even it
does change the measured number of buffer hits, because of
effective_cache_size.  It knows you will be hitting ~47,000 pages ~25 times
each, and so only charges you for the first time each one is hit.)


> i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well as
> probably average case too)
> i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)
>

Right.  Your best case times are when the data is completely in cache.  But
your cost estimates are dominated by *_page_cost, which are irrelevant when
the data is entirely in cache.  You are telling it to estimate the
worse-case costs, and it is doing that pretty well (within this one plan).


>
> i2 Rows Removed by Filter = 900835, 980350, 991389
> i2 buffers = 46985, 46985, 46987
> i2 best case time = 377.554, 346.481, 387.874
> i2 cost estimates = 39166.34, 39247.83, 40167.34
>
> It's odd that increase in actual execution time for "i1" was not reflected
> enough in cost estimates.
>

No, that's entirely expected given your settings.  As long as you are
charging disk-read costs for reading data from RAM, you

Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-12 Thread Jeff Janes
On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby  wrote:

> On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> > Jeff Janes  writes:
> > > On Dec 3, 2017 15:31, "Tom Lane"  wrote:
> > >> Jeff Janes  writes:
> > >>> But I do see that ties within the logical order of the column values
> are
> > >>> broken to agree with the physical order.  That is wrong, right?  Is
> there
> > >>> any argument that this is desirable?
> >
> > >> Uh ... what do you propose doing instead?  We'd have to do something
> with
> > >> ties, and it's not so obvious this way is wrong.
> >
> > > Let them be tied.
> ...
> > I thought some more about this.  What we really want the correlation stat
> > to do is help us estimate how randomly an index-ordered scan will access
> > the heap.  If the values we've sampled are all unequal then there's no
> > particular issue.  However, if we have some group of equal values, we
> > do not really know what order an indexscan will visit them in.  The
> > existing correlation calculation is making the *most optimistic possible*
> > assumption, that such a group will be visited exactly in heap order ---
> > and that assumption isn't too defensible.
>
> I'm interested in discusstion regarding bitmap cost, since it would have
> helped
> our case discussed here ~18 months ago:
> https://www.postgresql.org/message-id/flat/20160524173914.GA11880%
> 40telsasoft.com#20160524173914.ga11...@telsasoft.com
>
> ...but remember: in Vitaliy's case (as opposed to mine), the index scan is
> *faster* but being estimated at higher cost than bitmap (I have to keep
> reminding myself).  So the rest of this discussion is about the
> overly-optimistic cost estimate of index scans, which moves in the opposite
> direction for this reported problem.  For the test cases I looked at, index
> scans were used when RPC=1 and redundant conditions were avoided, so I'm
> not
> sure if there's any remaining issue (but I haven't looked at the latest
> cases
> Vitaliy sent).
>
> > In any case, given that we do this calculation without regard
> > to any specific index,
>
> One solution is to compute stats (at least correlation) for all indices,
> not
> just expr inds.  I did that earlier this year while throwing around/out
> ideas.
> https://www.postgresql.org/message-id/20170707234119.
> GN17566%40telsasoft.com


When is the correlation of a column which is not the leading column of a
btree index or in a brin index ever used?  If we did compute index-specific
correlations, we could maybe just drop pure-column correlations.


>
> > We do have an idea, from the data we have, whether the duplicates are
> close
> > together in the heap or spread all over.
>
> I think you just mean pg_stats.correlation for all values, not just
> duplicates
> (with the understanding that duplicates might be a large fraction of the
> tuples, and high weight in correlation).
>
> Another issue I noted in an earlier thread is that as table sizes
> increase, the
> existing correlation computation approaches 1 for correlated insertions,
> (like
> "append-only" timestamps clustered around now()), due to ANALYZE sampling a
> fraction of the table, and thereby representing only large-scale
> correlation,
>

That isn't due to sampling.  That is due to the definition of linear
correlation.  Large scale is what it is about.


> Generated data demonstrating this (I reused this query so it's more
> complicated
> than it needs to be):
>
> [pryzbyj@database ~]$ time for sz in {,9{,9{,9{,9 ; do psql
> postgres -tc "DROP TABLE IF EXISTS t; CREATE TABLE t(i float, j int);
> CREATE INDEX ON t(i);INSERT INTO t SELECT i/9.0+pow(2,(-random())) FROM
> generate_series(1,$sz) i ORDER BY i; ANALYZE t; SELECT $sz, correlation,
> most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'";
> done
>
>   |0.187146 |
> 9 |0.900629 |
>99 |0.998772 |
>   999 |0.87 |
>

Because the amount of jitter introduced is constant WRT $sz, but the range
of i/9.0 increases with $sz, the correlation actually does increase; it
is not a sampling effect.

Trying to keep it all in my own head: For sufficiently large number of
> pages,
> bitmap scan should be preferred to idx scan due to reduced random-page-cost
> outweighing its overhead in CPU cost.


But CPU cost is probably not why it is losing anyway.

Index scans get a double bonus from high correlation.  It assumes that only
a small fraction of the table will be visited.  And then it as

Re: Batch insert heavily affecting query performance.

2017-12-27 Thread Jeff Janes
On Sun, Dec 24, 2017 at 11:51 AM, Jean Baro  wrote:

> Hi there,
>
> We are testing a new application to try to find performance issues.
>
> AWS RDS m4.large 500GB storage (SSD)
>

Is that general purpose SSD, or provisioned IOPS SSD?  If provisioned, what
is the level of provisioning?

Cheers,

Jeff


Re: primary key hash index

2018-01-04 Thread Jeff Janes
On Tue, Jan 2, 2018 at 6:02 AM, Rick Otten  wrote:

> After reading this article about keys in relational databases, highlighted
> on hacker news this morning:
> https://begriffs.com/posts/2018-01-01-sql-keys-in-depth.html
>
> I keep pondering the performance chart, regarding uuid insert, shown
> towards the bottom of the article.  I believe he was doing that test with
> PostgreSQL.
>
> My understanding is that the performance is degrading because he has a
> btree primary key index.  Is it possible to try a hash index or some other
> index type for a uuid primary key that would mitigate the performance issue
> he is recording?
>

Hash indexes do not yet support primary keys, but you could always test it
with just an plain index, since you already know the keys are unique via
the way they are constructed.  But I wouldn't expect any real improvement.
Hash indexes still trigger FPW and still dirty massive numbers of pages in
a random fashion (even worse than btree does as far as randomness goes but
since the hash is more compact maybe more of the pages will be re-dirtied
and so save on FPW or separate writes).  I was surprised that turning off
FPW was so effective for him, that suggests that maybe his checkpoints are
too close together, which I guess means max_wal_size is too low.

Cheers,

Jeff


Re: Need Help on wal_compression

2018-01-09 Thread Jeff Janes
On Tue, Jan 9, 2018 at 1:53 AM, Rambabu V  wrote:

> Hi Team,
>
> Daily 4000 Archive files are generating and these are occupying more
> space, we are trying to compress wall files with using wal_compression
> parameter, but we are not seeing any change in wal files count, could you
> please help us on this.
>

If the number of files is driven by archive_timeout, then no reduction in
the number of them would be expected by turning on wal_compression.

If the number of files is driven by the 16MB limit on each file, then it is
surprising that wal_compression did not change it. (But the difference
might not be all that large, depending on the type of transactions and data
you are working with.)

I use an external compression program, xz, which compresses very well.  But
it is slow and has trouble keeping up at times of peak activity (e.g. bulk
loads or updates, or reindexing).  It reduces the aggregate size, but not
the number of files.

Cheers,

Jeff


Re: Query is slow when run for first time; subsequent execution is fast

2018-01-10 Thread Jeff Janes
On Wed, Jan 10, 2018 at 3:59 AM, Nandakumar M  wrote:

>
> I am not using prepared statements. Postgres documentation and previous
> questions in the pgsql-performance mailing list mention that the query plan
> is cached only when prepared statements are used.
>
> https://www.postgresql.org/message-id/15600.1346885470%40sss.pgh.pa.us
>
> In the above thread Tom Lane mentions that the plan is never cached for
> raw queries. Yet, this is exactly what seems to be happening in my case. Am
> I missing something?
>

The query plan itself is not cached, but all the metadata about the (large
number) of tables used in the query is cached.  Apparently reading/parsing
that data is the slow step, not coming up with the actual plan.

> Please let me know how I can make sure the query execution for the first
time is fast too.

Don't keep closing and reopening connections.  Use a connection pooler
(pgbouncer, pgpool, whatever pooler is built into your
language/library/driver, etc.) if necessary to accomplish this.

Cheers,

Jeff


Re: Query is slow when run for first time; subsequent execution is fast

2018-01-16 Thread Jeff Janes
On Fri, Jan 12, 2018 at 12:03 AM, Nandakumar M  wrote:

> Hello Jeff,
>
> Thanks for the insights.
>
> >Don't keep closing and reopening connections.
>
> Even if I close a connection and open a new one and execute the same
> query, the planning time is considerably less than the first time. Only
> when I restart the Postgres server then I face high planning time again.
>

Oh.  I've not seen that before.  But then again I don't often restart my
server and then immediately run very large queries with a stringent time
deadline.

You can try pg_prewarm, on pg_statistic table and its index.  But I'd
probably just put an entry in my db startup script to run this query
immediately after startng the server, and let the query warm the cache
itself.

Why do you restart your database often enough for this to be an issue?

Cheers,

Jeff


Re: need help on memory allocation

2018-01-23 Thread Jeff Janes
On Tue, Jan 23, 2018 at 5:59 AM, Rambabu V  wrote:

> > cat PostgreSQL-2018-01-23_06.csv|grep FATAL

What about ERROR, not just FATAL?  Or grep for "out of memory"



>> *$ free -mh*
>>  total   used   free sharedbuffers cached
>> Mem:   58G58G   358M16G   3.6M41G
>> -/+ buffers/cache:16G42G
>> Swap: 9.5G   687M   8.9G
>>
>
This does not seem like it should be a problem.  Is this data collected
near the time of the failure?


> work_mem = 256MB # min 64kB
>> max_connections = 600
>>
>
These look pretty high, especially in combination.  Why do you need that
number of connections?  Could you use a connection pooler instead?  Or do
just have an application bug (leaked connection handles) that needs to be
fixed?  Why do you need that amount of work_mem?


> *ps -ef|grep postgres|grep idle|wc -l*
>> 171
>>
>> *ps -ef|grep postgres|wc -l*
>> 206
>>
>
How close to the time of the problem was this recorded?  How many of the
idle are 'idle in transaction'?


>> PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
>>  109063 postgres   20   0 16.7G 16.4G 16.3G S  0.0 27.8 39:55.61
>> postgres: test sss 10.20.2.228(55174) idle
>>   24910 postgres   20   0 16.7G 16.4G 16.3G S  0.0 27.8 27:45.35
>> postgres: testl sss 10.20.2.228(55236) idle
>>  115539 postgres   20   0 16.7G 16.4G 16.3G S  0.0 27.8 28:22.89
>> postgres: test sss 10.20.2.228(55184) idle
>>9816 postgres   20   0 16.7G 16.4G 16.3G S  0.0 27.8 40:19.57
>> postgres: test sss   10.20.2.228(55216) idle
>>
>
How close to the time of the problem was this recorded?  Nothing here seems
to be a problem, because almost all the memory they have resident is shared
memory.

It looks like all your clients decide to run a memory hungry query
simultaneously, consume a lot of work_mem, and cause a problem.  Then by
the time you notice the problem and start collecting information, they are
done and things are back to normal.

Cheers,

Jeff


Re: effective_io_concurrency on EBS/gp2

2018-01-31 Thread Jeff Janes
On Wed, Jan 31, 2018 at 4:03 AM, Vitaliy Garnashevich <
vgarnashev...@gmail.com> wrote:

>
> The results look really confusing to me in two ways. The first one is that
> I've seen recommendations to set effective_io_concurrency=256 (or more) on
> EBS.


I would not expect this to make much of a difference on a table which is
perfectly correlated with the index.  You would have to create an accounts
table which is randomly ordered to have a meaningful benchmark of the eic
parameter.

I don't know why the default for eic is 1.  It seems like that just turns
on the eic mechanism, without any hope of benefiting from it.

Cheers,

Jeff


Re: Memory size

2018-03-11 Thread Jeff Janes
On Sun, Mar 11, 2018 at 5:48 AM, dangal  wrote:

>
> Dear some consultation, I have a base of about 750 GB in size and we are
> having problem of slowness in certain views of the application, so I have
> been seeing it is apparently a memory problem because if I run again the
> view runs fast, the base is in a virtual server with 24 GB of RAM and 8 GB
> of shared buffer, with this information how much would you recommend to put
> a memory in the server
>

There is no way to answer that with the information you provide.

Are the "certain views" run with different supplied parameters on different
executions, or are they run with no parameters or unchanging ones?

How long can you wait between the first run and the second run before the
second run is no longer fast?

Cheers,

Jeff


Re: Memory size

2018-03-11 Thread Jeff Janes
On Sun, Mar 11, 2018 at 10:33 AM, dangal  wrote:

> jeff thank you very much for your time, I tell you, they are the same
> queries
> with the same parameters, I take 3 minutes for example, but I execute it
> and
> it takes me seconds, that's why I suspect it is the shared buffer
> The server had 16 GB and we increased it to 24, but I really do not know if
> it should continue to increase since they are not our own resources, we
> have
> to ask for them and justify them
>

If that is the only query that you have trouble with, it might be easiest
just to set up a cron job to run it periodically just to keep that data set
in cache.  Not very elegant, but it can be effective.

Cheers,

Jeff


Re: Sort is generating rows

2018-05-31 Thread Jeff Janes
On Thu, May 31, 2018 at 7:22 AM, Nicolas Seinlet 
wrote:

> Hi,
>
> I have a query with a strange query plan.
>
> This query is roughly searching for sales, and convert them with a
> currency rate. As currency rate changes from time to time, table contains
> the currency, the company, the rate, the start date of availability of this
> rate and the end date of availability.
>
> The join is done using :
> left join currency_rate cr on (cr.currency_id = pp.currency_id and
>   cr.company_id = s.company_id and
>   cr.date_start <= coalesce(s.date_order, now()) and
>  (cr.date_end is null or cr.date_end > coalesce(s.date_order,
> now(
>
> The tricky part is the date range on the currency rate, which is not an
> equality.
>
> the query plan shows:
> ->  Sort  (cost=120.13..124.22 rows=1637 width=56) (actual
> time=14.300..72084.758 rows=308054684 loops=1)
>   Sort Key: cr.currency_id, cr.company_id
>   Sort Method: quicksort  Memory: 172kB
>   ->  CTE Scan on currency_rate cr
> (cost=0.00..32.74 rows=1637 width=56) (actual time=1.403..13.610 rows=1576
> loops=1)
>
> There's 2 challenging things :
> - planner estimates 1637 rows, and get 300 million lines
> - sorting is generating lines
>

These are both explained by the same thing.  The sort is feeding into a
merge join.  For every row in the other node which have the same value of
the scan keys, the entire section of this sort with those same keys gets
scanned again.  The repeated scanning gets counted in the actual row count,
but isn't counted in the expected row count, or the actual row count of the
thing feeding into the sort (the CTE)


>
>
For now, the more currency rates, the slowest the query. There's not that
> much currency rates (1k in this case), as you can only have one rate per
> day per currency.
>

If it is only per currency per day, then why is company_id present? In any
case, you might be better off listing the rates per day, rather than as a
range, and then doing an equality join.

Cheers,

Jeff


Re: significant jump in sql statement timing for on server vs a remote connection

2022-04-19 Thread Jeff Janes
On Tue, Apr 19, 2022 at 5:00 PM Sbob  wrote:

>
> However if we move the file to another server in the same network and
> run with a psql -h then it runs for more than 10min.


What is the ping time?  Packet loss? You can't take for granted that the
network is good and fast just because they are on the same LAN.

Cheers,

Jeff


Re: Postgresql TPS Bottleneck

2022-04-20 Thread Jeff Janes
On Wed, Apr 20, 2022 at 5:13 AM  wrote:

>
> The next thing I did was starting two independent Postgres instances on
> the same server and run independent client applications against each of
> them. This resulted in our application getting almost double of the TPS
> compared to running a single instance (from 13k to 23k) - Each Postgres
> instance had about 45k TPS which did not increase (?).
>

How could that be?  Isn't there a one to one correspondence between app
progress and PostgreSQL transactions?  How could one almost double while
the other did not increase?  Anyway, 2x45 does seem like an increase
(smallish) over 65.

Your bottleneck for pgbench may be IPC/context switches.  I noticed that -S
did about 7 times more than the default, and it only makes one round trip
to the database while the default makes 7.

You could package up the different queries made by the default transaction
into one function call, in order to do the same thing but with fewer round
trips to the database. This would be an easy way to see if my theory is
true.  If it is, I don't know what that would mean for your app though, as
we know nothing about its structure.

I have a patch handy (attached) which implements this feature as the
builtin transaction "-b tpcb-func".  If you don't want to recompile
pgbench, you could dissect the patch to reimplement the same thing as a -f
style transaction instead.

Note that packaging it up this way does violate the spirit of the
benchmark, as clearly someone is supposed to look at the results of the
first select before deciding to proceed with the rest of the transaction.
But you don't seem very interested in the spirit of the tpc-b benchmark,
just in using it as a tool to track down a bottleneck.

Cheers,

Jeff


pgbench_function_v13.patch
Description: Binary data


Re: Array of integer indexed nested-loop semi join

2022-04-27 Thread Jeff Janes
On Wed, Apr 27, 2022 at 8:19 AM Mickael van der Beek <
mickael.van.der.b...@gmail.com> wrote:

>
> The last query does not finish after waiting for more than 15 minutes.
> (The temporary view creation is very fast and required due to the same
> query in a CTE greatly reducing performance (by more than 5 min.) due to
> the optimisation barrier I'm guessing.)
>

How much over 15 minutes?  20 minutes doesn't seem that long to wait to get
a likely definitive answer.  But at the least show us the EXPLAIN without
ANALYZE of it, that should take no milliseconds.

And what does it mean for something to take 5 minutes longer than "never
finishes"?

(Also, putting every or every other token on a separate line does not make
it easier to read)

Cheer,

Jeff

>


Re: Window partial fetch optimization

2022-05-04 Thread Jeff Janes
On Tue, May 3, 2022 at 2:11 PM Levi Aul  wrote:

> I have a “temporal table” — a table where there are multiple “versions” of
> entities, with each version having a distinct timestamp:
> CREATE TABLE contract_balance_updates (
> block_id bigint NOT NULL,
> block_signed_at timestamp(0) without time zone NOT NULL,
> contract_address bytea NOT NULL,
> holder_address bytea NOT NULL,
> start_block_height bigint NOT NULL,
> balance numeric NOT NULL
> ) PARTITION BY RANGE (block_signed_at);
>
> -- one for each partition (applied by pg_partman from a template)
> CREATE UNIQUE INDEX contract_balance_updates_pkey
> ON contract_balance_updates(
> holder_address bytea_ops,
> contract_address bytea_ops,
> start_block_height int8_ops DESC
> );
>

How does pg_partman deal with the fact that a unique index on a partitioned
table must contain the partitioning key?

It should be noted that your 3 queries don't return the same thing.  The
last one returns columns holder_address, contract_address, and balance,
while the first returns all columns in the table.  If you were to make the
first query return just the three columns holder_address, contract_address,
and balance and build a suitable index, then you could get it to use an
index-only scan.  This should be similar to (but probably faster than) your
3rd query, without all the kerfuffle of extra scans and dummy syntax.  The
index needed would be:

(holder_address bytea_ops, contract_address bytea_ops, start_block_height,
balance);

Note that in theory it could do a better job of using the index you already
have.  It could compute the row_number using only the data available in the
index, then go fetch the table tuple for just the rows which pass the
row_number filter.  But it just isn't smart enough to do that. (By
separating the WHERE clause from the select list into different queries,
that is essentially what your third query is tricking it into doing)

Cheers,

Jeff


Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread Jeff Janes
On Wed, May 4, 2022 at 7:15 PM André Hänsel  wrote:

> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;
>

They are sorted by order_id only within sets of the same shipping_date,
which is not good enough.  (It would be good enough if it were smart enough
to know that there is only one possible shipping date to satisfy your weird
range condition.)

Cheers,

Jeff


Re: Array of integer indexed nested-loop semi join

2022-05-22 Thread Jeff Janes
On Fri, May 20, 2022 at 6:42 AM Mickael van der Beek <
mickael.van.der.b...@gmail.com> wrote:

>
> Query:
>
> EXPLAIN (
>>   ANALYZE,
>>   VERBOSE,
>>   COSTS,
>>   BUFFERS,
>>   TIMING
>> )
>> SELECT
>>   fu.w2_page_idxs
>> FROM
>>   fact_users
>> AS fu
>> WHERE
>>   EXISTS (
>> SELECT
>> FROM
>>   (
>> SELECT
>>   ARRAY[idx] AS page_idx
>> FROM
>>   fact_pages
>> WHERE
>>   attribute_idxs && ARRAY[30160]
>> FETCH FIRST 1 ROWS ONLY
>>   )
>> AS fp
>> WHERE
>>   fu.w2_page_idxs && fp.page_idx
>>   )
>> ;
>
>
> Without any surprises, the planner is using a sequential scan on the
> "fact_users" table which is very large instead of using the GIN index set
> on the "w2_page_idxs" column.
>

For me, using the subquery in and expression, instead of the EXISTS, does
get it to use the gin index.  And I think it must give the same results.

SELECT
  fu.w2_page_idxs
FROM  fact_users AS fu
WHERE
  fu.w2_page_idxs && ARRAY[(select idx from fact_pages where
attribute_idxs && ARRAY[3003] FETCH FIRST 1 ROWS ONLY)];

But why are you using intarray?  That is unnecessary here, and by creating
ambiguity about the array operators it might be harmful.

Cheers,

Jeff

>


Re: Array of integer indexed nested-loop semi join

2022-05-23 Thread Jeff Janes
On Mon, May 23, 2022 at 3:57 AM Mickael van der Beek <
mickael.van.der.b...@gmail.com> wrote:

> Hello Jeff,
>
> Sadly, the query you suggested won't work because you are only returning
> the first row of the matching inner query rows.
>

Sure, but the query I replaced did the same thing.  (I thought that was
what you wanted, but I guess that was just to get it to run fast enough to
ever finish--in that case it is probably better to use EXPLAIN without the
ANALYZE so that we can see the plan of the correct query).  To get around
that one-row limit you have to write it somewhat differently, getting rid
of the ARRAY and adding an array_agg():

SELECT fu.*
FROM
  fact_users AS fu
WHERE
  fu.w2_page_idxs && (select array_agg(idx) from fact_pages where
attribute_idxs && ARRAY[201]);

This way of writing it is better, as it still works with the LIMIT 1 but
also works without it.  This still uses the indexes for me, at least when
enable_seqscan is off.


> The INNER JOIN version of the query will return all matching rows but also
> include duplicates:
>

You could just add a DISTINCT to get rid of the duplicates.  Of course that
will also take some time on a large returned data set, but probably less
time than scanning a giant table.  I think this is probably cleaner than
the alternatives.


>
> The reason I'm using integer arrays is because it is the only way I have
> found in PostgreSQL to get fast inclusion / exclusion checks on large
> datasets (hundreds of millions of values).
> Did I misunderstand your response?
>

I don't know if you misunderstood.  I meant specifically the intarray
extension.  You can use integer arrays with built-in GIN indexes without
help from the intarray extension.  Maybe you know that already and are just
saying that the extension is even faster than the built-in indexed
operators are and you need that extra speed.

Cheers,

Jeff

>


Re: REINDEXdb performance degrading gradually PG13.4

2022-06-01 Thread Jeff Janes
On Tue, May 31, 2022 at 11:14 AM Praneel Devisetty <
devisettypran...@gmail.com> wrote:

>
> Hi,
>>
>> We are trying to reindex 600k tables in a single database  of size 2.7TB
>> using reindexdb utility in a shell script
>> reindexdb -v -d $dbname -h $hostname -U tkcsowner --concurrently -j
>> $parallel -S $schema
>>
>>
What is the value of $parallel?  Are all the tables in the same schema?


> Initially it was processing 1000 tables per minute. Performance is
>> gradually dropping and now after 24 hr it was processing 90 tables per
>> minute.
>>
>
I can't even get remotely close to 1000 per minute with those options, even
with only 10 single-index tables with all of them being empty.  Are you
sure that isn't 1000 per hour?

Using --concurrently really hits the stats system hard (I'm not sure why).
 Could you just omit that?  If it is running at 1000 per minute or even per
hour, does it really matter if the table is locked for as long as it takes
to reindex?

Cheers,

Jeff


Re: rows selectivity overestimate for @> operator for arrays

2022-06-01 Thread Jeff Janes
On Fri, May 27, 2022 at 12:19 PM Alexey Ermakov <
alexey.erma...@dataegret.com> wrote:

> Hello, please look into following example:
>
> postgres=# create table test_array_selectivity as select
> array[id]::int[] as a from generate_series(1, 1000) gs(id);
> SELECT 1000
> postgres=# explain analyze select * from test_array_selectivity where a
> @> array[1];
>   QUERY PLAN
>
> -
>   Seq Scan on test_array_selectivity  (cost=0.00..198531.00 rows=5
> width=32) (actual time=0.023..2639.029 rows=1 loops=1)
> Filter: (a @> '{1}'::integer[])
> Rows Removed by Filter: 999
>   Planning Time: 0.078 ms
>   Execution Time: 2639.038 ms
> (5 rows)
>
>
> for row estimation rows=5=1000*0.005 we are using constant
> DEFAULT_CONTAIN_SEL if I'm not mistaken.
> and we're using it unless we have something in most_common_elems (MCE)
> in statistics which in this case is empty.
>
>
This was discussed before at
https://www.postgresql.org/message-id/flat/CAMkU%3D1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw%40mail.gmail.com

My solution was to always store at least one element in the MCE, even if
the sample size was too small to be reliable.  It would still be more
reliable than the alternative fallback assumption.  That patch still
applies and fixes your example, or improves it anyway and to an extent
directly related to the stats target size. (It also still has my bogus code
comments in which I confuse histogram with n_distinct).

Then some other people proposed more elaborate patches, and I never wrapped
my head around what they were doing differently or why the elaboration was
important.

Since you're willing to dig into the source code and since this is directly
applicable to you, maybe you would be willing to go over to pgsql-hackers
to revive, test, and review these proposals with an eye of getting them
applied in v16.

I'm not sure if there is a simple fix for this, maybe store and use
> something like n_distinct for elements for selectivity estimation ? or
> perhaps we should store something in MCE list anyway even if frequency
> is low (at least one element) ?
>

n_distinct might be the best solution, but I don't see how it could be
adapted to the general array case.  If it could only work when the vast
majority or arrays had length 1, I think that would be too esoteric to be
accepted.

Cheers,

Jeff


Re: reindex option for tuning load large data

2022-06-18 Thread Jeff Janes
On Fri, Jun 17, 2022 at 1:34 AM James Pang (chaolpan) 
wrote:

> Hi ,
>
>   We plan to migrate large database from Oracle to Postgres(version 13.6,
> OS Redhat8 Enterprise), we are checking options to make data load in
> Postgres fast. Data volume is about several TB,  thousands of indexes,
> many large table with partitions.  We want to make data load running fast
> and avoid miss any indexes when reindexing. There are 2 options about
> reindex. Could you give some suggestions about the 2 options, which option
> is better.
>
>
>
>1. Create tables and indexes( empty database) ,   update pg_index set
>indisready=false and inisvalid=false,  then  load data use COPY from csv ,
>then reindex table …
>
>
Where did this idea come from?  This is likely to destroy your database.


> 2).  Use pg_dump to dump meta data only , then copy “CREATE INDEX … sql “
>
> Drop indexes before data load
>
>After data load, increase max_parallel_maintenance_workers,
> maintenance_work_mem
>
>Run CREATE INDEX … sql  to leverage parallel create index feature.
>

pg_dump doesn't run against Oracle, so where is the thing you are running
pg_dump against coming from?

If you already have a fleshed out schema in PostgreSQL, you should dump the
sections separately (with --section=pre-data and --section=post-data) to
get the commands to build the objects which should be run before and after
the data is loaded.

Cheers,

Jeff

>


Re: Postgresql 14 partitioning advice

2022-07-27 Thread Jeff Janes
On Wed, Jul 27, 2022 at 8:55 AM Rick Otten  wrote:

>
> One person I talked to said "try not to have more than 100 partitions",
> even with the latest postgresql you'll end up with a lot of lock contention
> if you go over 100 partitions.
>
>
It is hard to know how seriously to take the advice of anonymous people
accompanied with such sparse justification.  Meanwhile, people who actually
wrote the code seem to think that this problem has been mostly overcome
with declarative partitioning in the newer versions.

When you do decide to start removing the oldest data, how will you do it?
Your partitioning should probably be designed to align with this.

> Since the data most frequently queried would be recent data (say the past
month or so)

Is this done specifically with a time clause, or just by criteria which
happen to align with time, but have no formal relationship with it?

Cheers,

Jeff


Re: Catching up with performance & PostgreSQL 15

2022-12-03 Thread Jeff Janes
On Tue, Nov 29, 2022 at 4:07 PM David Rowley  wrote:

> On Wed, 30 Nov 2022 at 03:31, Tom Lane  wrote:
> >
> > Alvaro Herrera  writes:
> > > IMO it was a mistake to turn JIT on in the default config, so that's
> one
> > > thing you'll likely want to change.
> >
> > I wouldn't necessarily go quite that far, but I do think that the
> > default cost thresholds for invoking it are enormously too low,
> > or else there are serious bugs in the cost-estimation algorithms
> > for deciding when to use it.  A nearby example[1] of a sub-1-sec
> > partitioned query that took 30sec after JIT was enabled makes me
> > wonder if we're accounting correctly for per-partition JIT costs.
>
> I'm very grateful for JIT. However, I do agree that the costs need to work.
>
> The problem is that the threshold to turn JIT on does not consider how
> many expressions need to be compiled. It's quite different to JIT
> compile a simple one-node plan with a total cost of 10 than to JIT
> compile a plan that  costs the same but queries 1000 partitions. I
> think we should be compiling expressions based on the cost of the
> individial node rather than the total cost of the plan.


I think a big win for JIT would be to be able to do it just once per cached
plan, not once per execution.  And then have it turned on only for prepared
statements.  Of course that means JIT couldn't do parameter folding, but I
don't know if it does that anyway.  Also, very expensive plans are
generally dominated by IO cost estimates, and I think it doesn't make sense
to drive JIT decisions based predominantly on the expected cost of the IO.
If the planner separated IO cost estimate totals from CPU cost estimate
totals, it might open up better choices.

Cheers,

Jeff


Re: LIKE CLAUSE on VIEWS

2023-01-22 Thread Jeff Janes
On Sun, Jan 22, 2023 at 6:34 AM aditya desai  wrote:

> Hi,
> Is there any way to improve performance of LIKE clause on VIEWS.
>
> select * From request_vw where upper(status) like '%CAPTURED%' - 28
> seconds.
>

You would need to have an expression index over upper(status) to support
such a query, not an index on status itself.  It would probably be better
to just use ILIKE rather than upper(), so `status ILIKE '%captured%'`,
which can benefit from an index on "status" itself.

Also as this is VIEW TRIGRAM nor normal indexes don't get used.
>

There is no problem in general using trigram indexes (or any other index
types) on views.  Maybe your view has particular features which inhibit the
use of the index, but you haven't given any information which would be
useful for assessing that.  Did you try an index, or just assume it
wouldn't work without trying?

Cheers,

Jeff

>


Re: Performance of UPDATE operation

2023-02-13 Thread Jeff Janes
On Mon, Feb 13, 2023 at 10:09 AM Mkrtchyan, Tigran 
wrote:

>
>  0.524   0  BEGIN;
>  0.819   0  INSERT INTO t_inodes (inumber, icrtime,
> igeneration)
>  0.962   0  UPDATE t_inodes SET igeneration = igeneration
> + 1 where  inumber = :inumber;
>  9.203   0  END;
> ```
>
> My naive expectation will be that updating the newly inserted record
> should cost nothing


It takes less than 1/10 of the total time.  That is pretty close to
nothing.  Why would you expect it to be truly free?


> ... Are there ways
> to make it less expensive?
>

Obviously here you could just insert the correct value in the first place
and not do the update at all.

Cheers,

Jeff


Re: Connection forcibly closed remote server error.

2023-02-15 Thread Jeff Janes
On Wed, Feb 15, 2023 at 7:13 AM aditya desai  wrote:

> Hi,
> We are getting this error when transferring data using COPY command or
> running workflow for huge data. We are using Password Authentication(LDAP)
>
> "Connection forcibly closed remote server"
>

Are you sure that that is the exact wording? It doesn't sound like grammar
that would be used for an error message.  Or did you perhaps translate it
to English from a localized error message?

Is that error reported by the client, or in the server log?  Whichever end
that is, what does the other end say?

Cheers,

Jeff


Re: Planner choosing nested loop in place of Hashjoin

2023-03-11 Thread Jeff Janes
On Tue, Mar 7, 2023 at 7:14 AM Praneel Devisetty 
wrote:

> Hi,
>
> I have a query which is taking roughly 10mins to complete and the query
> planner is choosing a nested loop.
>
> query and query plan with analyze,verbose,buffers
> qsEn | explain.depesz.com 
>
>
What version is this?  Any chance you can share this without
anonymization?  Not knowing the actual names makes it a lot harder to
understand.  In particular, what is the actual function golf_romeo()? And
five_two()?  And what is the regexp pattern that is bastardized into
'oscar_mike'::text ?


> Disabling the nested loop on session is allowing the query planner to
> choose a better plan and complete it in 2mins.Stats are up to date and
> analyze was performed a few hours ago.
>

A lot can change in a few hours, do another analyze immediately before
gathering the execution plan.  Your row estimates are dreadful, but we
can't really tell why with the info provided.

Cheers,

Jeff


Re: thousands of CachedPlan entry per backend

2023-06-02 Thread Jeff Janes
On Thu, Jun 1, 2023 at 4:51 AM James Pang (chaolpan) 
wrote:

>   2) from  this line, we saw total 42 blocks ,215 chunks
> CacheMemoryContext: 8737352 total in 42 blocks; 1021944 free (215 chunks);
> 7715408 used,
>
>   But from sum of it’s child level entrys,  total sum(child lines)
> block ,trunks show much more than “CacheMemoryContext,  is expected to see
> that?
>

Yes, that is expected.  The parent context reports only its own direct
memory usage and blocks.  It does not include the sum of memory usage of
its child contexts.

Cheers,

Jeff

>


Re: Plan weirdness. A sort produces more rows than the node beneath it

2023-08-04 Thread Jeff Janes
On Fri, Aug 4, 2023 at 11:00 AM Dane Foster  wrote:

> Hello,
>
> I'm trying to understand a bit of weirdness in a plan output. There is a
> sort node above a sequential scan node where the scan node produces 26,026
> rows yet the sort node above it produces 42,995,408. How is it possible
> to sort more data than you received?
>

This is normal for a merge join.  For every tie in the first input, the
qualifying part of the 2nd input must be rescanned, and the rows are
tallied again (in the sort node) each time they are rescanned.

Cheers,

Jeff

>


Re: slow delete

2023-08-16 Thread Jeff Janes
On Tue, Aug 15, 2023 at 4:23 PM Les  wrote:

{
>
> "Trigger Name": "RI_ConstraintTrigger_a_75463",
>
> "Constraint Name": "fk_pfq_src_product_file",
>
> "Relation": "product_file",
>
> "Time": 11179.429,
>
> "Calls": 90
>
> },
>
...


> The one with fk_pfft_product looks like this, it has about 5000 records in
> it:
>

That constraint took essentially no time.  You need to look into the one
that took all of the time,
which is fk_pfq_src_product_file.

Cheers,

Jeff

>


Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Jeff Janes
On Tue, Aug 29, 2023 at 1:47 PM Rondat Flyag  wrote:

> I have a legacy system that uses `Posgresql 9.6` and `Ubuntu 16.04`.
> Everything was fine several days ago even with standard Postgresql
> settings. I dumped a database with the compression option (maximum
> compression level -Z 9) in order to have a smaller size (`pg_dump
> --compress=9 database_name > database_name.sql`). After that I got a lot of
> problems.
>

You describe taking a dump of the database, but don't describe doing
anything with it.  Did you replace your system with one restored from that
dump?  If so, did vacuum and analyze afterwards?

Cheers,

Jeff


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-29 Thread Jeff Janes
On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k 
wrote:

> Hi,
>
> TL;DR:
> Observations:
>
>1. REINDEX requires a full table scan
>   - Roughly create a new index, rename index, drop old index.
>   - REINDEX is not incremental. running reindex frequently does not
>   reduce the future reindex time.
>2. REINDEX does not use the index itself
>3. VACUUM does not clean up the indices. (relpages >> reltuples) I
>understand, vacuum is supposed to remove pages only if there are no live
>tuples in the page, but somehow, even immediately after vacuum, I see
>relpages significantly greater than reltuples. I would have assumed,
>relpages <= reltuples
>4. Query Planner does not consider index bloat, so uses highly bloated
>partial index that is terribly slow over other index
>
> Your points 3 and 4 are not correct.  empty index pages are put on a
freelist for future reuse, they are not physically removed from the
underlying index files.  Maybe they are not actually getting put on the
freelist or not being reused from the freelist for some reason, but that
would be a different issue.  Use the extension pgstattuple to see what its
function pgstatindex says about the index.

The planner does take index bloat into consideration, but its effect size
is low.  Which it should be, as empty or irrelevant pages should be
efficiently skipped during the course of most index operations. To figure
out what is going with your queries, you should do an EXPLAIN (ANALYZE,
BUFFERS) of them, but with it being slow and with it being fast.


> Question: Is there a way to optimize postgres vacuum/reindex when using
> partial indexes?
>

Without knowing what is actually going wrong, I can only offer
generalities.  Make sure you don't have long-lived transactions which
prevent efficient clean up.  Increase the frequency on which vacuum runs on
the table.  It can't reduce the size of an already bloated index, but by
keeping the freelist stocked it should be able prevent it from getting
bloated in the first place.  Also, it can remove empty pages from being
linked into the index tree structure, which means they won't need to be
scanned even though they are still in the file.  It can also free up space
inside non-empty pages for future reuse within that same page, and so that
index tuples don't need to be chased down in the table only to be found to
be not visible.


> ```
> SELECT [columns list]
>   FROM tasks
>   WHERE status NOT IN (3,4,5) AND created > NOW() - INTERVAL '30 days' AND
> updated < NOW() - interval '30 minutes'
> ```
>
> Since we are only interested in the pending tasks, I created a partial
> index
>  `*"tasks_pending_status_created_type_idx" btree (status, created,
> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>

This looks like a poorly designed index.  Since the status condition
exactly matches the index where clause, there is no residual point in
having "status" be the first column in the index, it can only get in the
way (for this particular query).  Move it to the end, or remove it
altogether.

Within the tuples which pass the status check, which inequality is more
selective, the "created" one or "updated" one?

Cheers,

Jeff


Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-29 Thread Jeff Janes
On Tue, Aug 29, 2023 at 2:55 PM Rondat Flyag  wrote:

> I took the dump just to store it on another storage (external HDD). I
> didn't do anything with it.
>

I don't see how that could cause the problem, it is probably just a
coincidence.  Maybe taking the dump held a long-lived snapshot open which
caused some bloat.   But if that was enough to push your system over the
edge, it was probably too close to the edge to start with.

Do you have a plan for the query while it was fast?  If not, maybe you can
force it back to the old plan by setting enable_seqscan=off or perhaps
enable_sort=off, to let you capture the old plan for comparison.

The estimate for the seq scan of  isbns_statistics is off by almost a
factor of 2.  A seq scan with no filters and which can not stop early
should not be hard to estimate accurately, so this suggests autovac is not
keeping up.  VACUUM ANALYZE all of the involved tables and see if that
fixes things.

Cheers,

Jeff


Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-31 Thread Jeff Janes
On Wed, Aug 30, 2023 at 1:31 PM Rondat Flyag  wrote:

> Hi and thank you for the response.
>
> I tried VACUUM ANALYZE for three tables, but without success. I also tried
> to set enable_seqscan=off and the query took even more time. If I set
> enable_sort=off then the query takes a lot of time and I cancel it.
>

Maybe you could restore (to a temp server, not the production) a physical
backup taken from before the change happened, and get an old plan that
way.  I'm guessing that somehow an index got dropped around the same time
you took the dump.  That might be a lot of work, and maybe it would just be
easier to optimize the current query while ignoring the past.  But you
seem to be interested in a root-cause analysis, and I don't see any other
way to do one of those.

What I would expect to be the winning plan would be something sort-free
like:

Limit
  merge join
index scan yielding books in asin order (already being done)
nested loop
   index scan yielding asins in value order
   index scan probing asins_statistics driven
by asins_statistics.asin_id = asins.id

Or possibly a 2nd nested loop rather than the merge join just below the
limit, but with the rest the same

In addition to the "books" index already evident in your current plan, you
would also need an index leading with asins_statistics.asin_id, and one
leading with asins.value.  But if all those indexes exists, it is hard to
see why setting enable_seqscan=off wouldn't have forced them to be used.

 Cheers,

Jeff


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-31 Thread Jeff Janes
On Wed, Aug 30, 2023 at 8:43 PM jayaprabhakar k 
wrote:

>
>
> On Tue, Aug 29, 2023, 12:43 PM Jeff Janes  wrote:
>
>> On Mon, Aug 28, 2023 at 8:33 PM jayaprabhakar k 
>> wrote:
>>
>>>
>>> Since we are only interested in the pending tasks, I created a partial
>>> index
>>>  `*"tasks_pending_status_created_type_idx" btree (status, created,
>>> task_type) WHERE status <> ALL (ARRAY[3, 4, 5])*`.
>>>
>>
>> This looks like a poorly designed index.  Since the status condition
>> exactly matches the index where clause, there is no residual point in
>> having "status" be the first column in the index, it can only get in the
>> way (for this particular query).  Move it to the end, or remove it
>> altogether.
>>
> Interesting. I don't understand why it will get in the way. Unfortunately
> we have a few other cases where status is used in filter. That said, I will
> consider how to get this to work.
> Would removing status from the index column, improve HOT updates %? For
> example, changing status from 1->2, doesn't change anything on the index
> (assuming other criteria for HOT updates are met), but I am not sure how
> the implementation is.
>

No, changes to the status column will not qualify as HOT updates, even if
status is only in the WHERE clause and not the index body.  I don't know if
there is a fundamental reason that those can't be done as HOT, or if it is
just an optimization that no one implemented.


>
>
>> Within the tuples which pass the status check, which inequality is more
>> selective, the "created" one or "updated" one?
>>
> Obviously updated time is more selective (after status), and the created
> time is included only to exclude some bugs in our system that had left some
> old tasks stuck in progress (and for sorting). We do try to clean
> up occasionally, but not each time.
>

If "created" were the leading column in the index, then it could jump
directly to the part of the index which meets the `created > ...` without
having to scroll through all of them and throw them out one by one.  But it
sounds like there are so few of them that being able to skip them wouldn't
be worth very much.


>
> However we cannot add an index on `updated` column because that timestamp
> gets updated over 10x on average for each task. Since if a single index use
> a column, then the update will not be HOT, and every index needs to be
> updated. That will clearly add a bloat to every index. Did I miss something?
>

Why does it get updated so much?  It seems like status should go from 1 to
2, then from 2 to 3,4,or 5, and then be done.  So only 2 updates, not 10.
Maybe the feature which needs this frequent update could be done in some
other way which is less disruptive.

But anyway, PostgreSQL has features to prevent the index bloat from
becoming too severe of a problem, and you should figure out why they are
not working for you.  The most common ones I know of are 1) long open
snapshots preventing clean up, 2) all index scans being bitmap index scans,
which don't to micro-vacuuming/index hinting the way ordinary btree
index scans do, and 3) running the queries on a hot-standby, where index
hint bits must be ignored.  If you could identify and solve this issue,
then you wouldn't need to twist yourself into knots avoiding non-HOT
updates.

Cheers,

Jeff

>


Re: Index bloat and REINDEX/VACUUM optimization for partial index

2023-08-31 Thread Jeff Janes
On Thu, Aug 31, 2023 at 11:06 AM Maxim Boguk  wrote:


> With the default value of autovacuum_vacuum_scale_factor (The default is
> 0.2 (20% of table size).) index will collect like 100M outdated/dead index
> entries before autovacuum kicks in and cleans them all (in a worst case),
> and of course it will lead to huge index bloat and awful performance.
>

Index bloat doesn't automatically lead to awful performance.  There must be
some additional factor at play.


> Even if you scale down autovacuum_vacuum_scale_factor to some
> unreasonable low value like 0.01, the index still bloats to the 5M dead
> entries before autovacuum run, and constant vacuuming of a huge 500M table
> will put a huge load on the database server.
>

For this type of situation, I would generally set
autovacuum_vacuum_scale_factor to 0, and use autovacuum_vacuum_threshold to
drive the vacuuming instead.  But I'd make those changes just on the queue
table(s), not system wide.  Due to the visibility map, the load on the
server does not need to be huge just due to the table, as the stable part
of the table can be ignored.  The problem is that each index still needs to
be read entirely for each vacuum cycle, which would not be much of a
problem for the partial indexes, but certainly could be for the full
indexes.  There are some very recent improvements in this area, but I don't
think they can be applied selectively to specific indexes.



>
> Unfortunately there is no easy way out of this situation from database
> side, in general I recommend not trying to implement a fast pacing queue
> like load inside of a huge and constantly growing table, it never works
> well because you cannot keep up partial efficient indexes for the queue in
> a clean/non-bloated state.
>
> In my opinion the best solution is to keep list of entries to process 
> ("*around
> 1000-1500 tasks in pending statuses")* duplicated in the separate tiny
> table (via triggers or implement it on the application level), in that case
> autovacuum will be able quickly clean dead entries from the index.
>

You should be able to use declarative partitioning to separate the "final"
tuples from the "active" tuples, to get the same benefit but with less work.

Cheers,

Jeff


Re: Dirty reads on index scan,

2023-09-24 Thread Jeff Janes
On Fri, Sep 22, 2023 at 5:44 AM Koen De Groote  wrote:

> Alright.
>
> So, if I want to speed up the query, apart from trying to vacuum it
> beforehand, I suspect I've hit the limit of what this query can do?
>

It is more a limit on the system as a whole, not just one query.  How is
this table being inserted?  updated?  deleted? Is the physical row order
correlated on the insert_timestamp column (look at pg_stats.correlation)?
If not, why not? (Based on the name of the column, i would expect it to be
highly correlated)

Did you try the VACUUM and if so did it work?  Knowing that might help us
figure out what else might work, even if you don't want to do the vacuum.
But why not just do the vacuum?

You should show us the actual plans, not just selected excerpts from it.
There might be clues there that you haven't excerpted.  Turn on
track_io_timing first if it is not on already.


> Because, the table is just going to keep growing. And it's a usually a
> query that runs one time per day, so it's a cold run each time.
>

Why do you care if a query run once per day takes 1 minute to run?


> Is this just going to get slower and slower and there's nothing that can
> be done about it?
>

It is probably not so much the size of the data (given that it is already
far too large to stay in cache) as the number of dead tuples it had to wade
through.  Having to read 16571 pages just to find 1000 tuples from a
single-loop index scan suggests you have a lot of dead tuples.  Like, 16
for every live tuple.  Why do you have so many, and why isn't index
micro-vacuuming cleaning them up?  Do you have long-running transactions
which are preventing clean up?  Are you running this on a standby?

Cheers,

Jeff


Re: GIN JSONB path index is not always used

2023-10-17 Thread Jeff Janes
On Tue, Oct 17, 2023 at 10:09 AM Tomasz Szymański  wrote:

> - Database version: 11.18

That is pretty old.  It is 3 bug-fix releases out of date even for its
major version, and the major version itself is just about to reach EOL and
is missing relevant improvements.

- Plan when it uses an index
> "Total Cost": 1165.26,
> - Plan when it doesn't use an index
> "Total Cost": 1184.3,
>

The JSON format for plans is pretty non-ideal for human inspection;
especially so once you include ANALYZE and BUFFERS, which you should do.
Please use the plain text format instead.  But I can see that the plans are
very similar in cost, so it wouldn't take much to shift between them.
Should we assume that not using the index is much slower (otherwise, why
would you be asking the question?)?



> - It seems maybe the index can't keep up(?) because of this heavy insertion
> SELECT * FROM pgstatginindex('user_p_meta_jsonb_path_idx');
>  version | pending_pages | pending_tuples
> -+---+
>2 |98 |  28807
> (1 row)
> Might it be the case that is cloggs up and cannot use the index when
> reading?
>

Definitely possible.  The planner does take those numbers into account when
planning.  The easiest thing would be to just turn off fastupdate for those
indexes.  That might make the INSERTs somewhat slower (it is hard to
predict how much and you haven't complained about the performance of the
INSERTs anyway) but should make the SELECTs more predictable and generally
faster.  I habitually turn fastupdate off and then turn it back on only if
I have an identifiable cause to do so.

If you don't want to turn fastupdate off, you could instead change the
table's autovac parameters to be more aggressive (particularly
n_ins_since_vacuum, except that that doesn't exist until v13), or have a
cron job call gin_clean_pending_list periodically.


- Last autovacuum for some reason happened 4 days ago
>
> n_live_tup  | 4591412
> n_dead_tup  | 370828
>

Based on those numbers and default parameters, there is no reason for it to
be running any sooner.  That reflects only 8% turnover while the default
factor is 20%.

Cheers,

Jeff


Re: GIN JSONB path index is not always used

2023-10-23 Thread Jeff Janes
On Mon, Oct 23, 2023 at 6:33 AM Tomasz Szymański  wrote:


>  Limit  (cost=0.00..1184.30 rows=21 width=4) (actual
> time=1567.136..1619.956 rows=1 loops=1)
>->  Seq Scan on account_user  (cost=0.00..256768.27 rows=4553 width=4)
> (actual time=1567.135..1619.953 rows=1 loops=1)
>


It thinks the seq scan will stop 99.5% early, after finding 21 out of 4553
qualifying tuples.  But instead it has to read the entire table to actually
find only 1.

The selectivity estimate of the @> operator has been substantially improved
in v13.  It is still far from perfect, but should be good enough to solve
this problem for this case and most similar cases.  Turning off fastupdate
on the index would probably also solve the problem, for a different reason.

Cheers,

Jeff


Re: [EXTERNAL] Re: Performance down with JDBC 42

2023-11-05 Thread Jeff Janes
On Sun, Nov 5, 2023 at 11:20 AM Abraham, Danny 
wrote:

> Thanks Laurenz,
>
> Traced two huge plans. They differ.
> The fast one does use Materialize and Memoize  (the psql).
> Is there something in JDBC 42 that blocks these algoruthms?


Directly blocking those is not likely. Maybe the way the drivers fetch
partial results is different, such that with one the planner knows to
expect only partial results to be fetched and with the other it does not.
So in one case it chooses the fast-start plan, and in the other it
doesn't.  But it will be hard to get anywhere if you just dribble
information at us a bit at a time.  Can you come up with a self-contained
test case?  Or at least show the entirety of both plans?

Cheers,

Jeff


Re: Strange "actual time" in simple CTE

2023-12-03 Thread Jeff Janes
On Sat, Dec 2, 2023 at 11:50 AM Jean-Christophe Boggio <
postgre...@thefreecat.org> wrote:

> Hello,
>
> I am trying to optimize a complex query and while doing some explains, I
> stumbled upon this :
>
>CTE cfg
>  ->  Result  (cost=2.02..2.03 rows=1 width=25) (actual
> time=7167.478..7167.481 rows=1 loops=1)
> ...
> How can this take 7 seconds?
>



> This really looks like an artefact (maybe in relation to the JIT compiler?)
>
>
Exactly.  The time taking to do the JIT compilations gets measured in
non-intuitive places in the plan.  I'm guessing that that is what is going
on here, especially since the time separately reported at the end of the
plan for JIT so closely matches this mysterious time.  Just turn JIT off, I
doubt it doing you any good anyway.

Cheers,

Jeff


Re: Parallel hints in PostgreSQL with consistent perfromance

2023-12-27 Thread Jeff Janes
On Wed, Dec 27, 2023 at 8:15 AM mohini mane 
wrote:

> Hello Team,
> I observed that increasing the degree of parallel hint in the SELECT
> query did not show performance improvements.
> Below are the details of sample execution with EXPLAIN ANALYZE
>

PostgreSQL doesn't have hints, unless you are using pg_hint_plan. Which you
should say if you are.

*Output:*
> PSQL query execution with hints 6 for 1st time => 203505.402 ms
> PSQL query execution with hints 6 for 2nd time => 27920.272 ms
> PSQL query execution with hints 6 for 3rd time => 27666.770 ms
> Only 6 workers launched, and there is no reduction in execution time even
> after increasing the degree of parallel hints in select query.
>

All you are showing here is the effect of caching the data in memory.  You
allude to changing the degree, but didn't show any results, or even
describe what the change was.  Is 6 the base from which you increased, or
is it the result of having done the increase?

Cheers,

Jeff


Re: Slow GroupAggregate and Sort

2023-12-28 Thread Jeff Janes
On Thu, Dec 28, 2023 at 12:03 PM Darwin Correa  wrote:

>
> when run the query (query.sql)  as you can see in explain (plan4_v3.txt)
> citus take about 18s to run all fragments
>

Where is plan4_v3.txt?  Is that hidden in some non-obvious way in one of
your links?


> but each fragment take at most 2s, so my questions are- why citus take
> this time in run all fragments?
>

I only see that one arbitrary fragment takes 2.7s, with no indication
whether that one is the slowest one or not.  But I am not used to reading
citus plans.


> also we remove partitions, and test only with citus, but query took more
> than a minute.
> as a note, we not have 72 shards on the same node we have 72 in total, 24
> shards each node.
>

I thought the point of sharding was to bring more CPU and RAM to bear than
can feasibly be obtained in one machine.  Doesn't that make 24 shards per
machine completely nuts?


>
> I think the problem was in Sort and in GroupAggregate  I no have idea how
> speed up this in master node, because the Custom Scan (Citus Adaptive)  is
> not too slow, the most time is consumed in master on Sort and group
>

You want to know why citus is so slow here, but also say it isn't slow and
something else is slow instead?

I'd break this down into more manageable chunks for investigation.
Populate one scratch table (on one node, not a hypertable) with all 2.6
million rows.  See how long it takes to populate it based on the citus
query, and separately see how long it takes to run the aggregate query on
the populated scratch table.

What version of PostgreSQL (and citus) are you using?  In my hands (without
citus being involved), the sort includes "users" as the last column, to
support the count(distinct users) operation.  I don't know why yours
doesn't do that.

Cheers,

Jeff


Re: Parallel hints in PostgreSQL with consistent perfromance

2023-12-28 Thread Jeff Janes
On Thu, Dec 28, 2023 at 7:47 AM mohini mane 
wrote:

> Thank you for your response !!
> I am experimenting with SQL query performance for SELECT queries on large
> tables and I observed that changing/increasing the degree of parallel hint
> doesn't give the expected performance improvement.
>

But you still have addressed the fact that PostgreSQL *does not have
planner hints*.

Are you using some nonstandard extension, or nonstandard fork?


> I have executed the SELECT query with 2,4 & 6 parallel degree however
> every time only 4 workers launched & there was a slight increase in
> Execution time as well,
>

Adding an ignored comment to your SQL would not be expected to do
anything.  So it is not surprising that it does not do anything about
the number of workers launched.  It is just a comment.  A note to the human
who is reading the code.


> why there is an increase in execution time with parallel degree 6 as
> compared to 2 or 4?
>

Those small changes seem to be perfectly compatible with random noise.  You
would need to repeat them dozens of times in random order, and then do a
statistical test to convince me otherwise.


>


Re: Questions about "Output" in EXPLAIN ANALYZE VERBOSE

2024-01-02 Thread Jeff Janes
On Tue, Jan 2, 2024 at 1:29 PM Jerry Brenner  wrote:

> We are currently on 13.9.
>

Why not just use the latest minor release, 13.13?  For security reasons,
that is the only minor release of v13 you should be using anyway.  I think
it is a bit much to hope that people will spend their time for free
researching obsolete minor releases.


> *Any node type accessing an index or table*
>
>- It looks like "Output" includes more than just the columns with
>predicates and/or being accessed or returned in other nodes.
>
> Not in my hands. For SELECTs it just lists the columns that are needed.
Your example is hard to follow because it appears to be just snippets of a
plan, with no example of the query to which it belongs.

Cheers,

Jeff


Re: Parallel hints in PostgreSQL with consistent perfromance

2024-01-03 Thread Jeff Janes
> *1st time query executed with PARALLEL DEGREE 2 *
> explain analyze select /*+* PARALLEL(A 2)* */ * from
> test_compare_all_col_src1 A;
>  QUERY
> PLAN
>
> 
>  Gather  (cost=10.00..45524.73 rows=949636 width=97) (actual
> time=0.673..173.017 rows=955000 loops=1)
>Workers Planned: 4
>   * Workers Launched: 4*
>->  Parallel Seq Scan on test_compare_all_col_src1 a
>  (cost=0.00..44565.09 rows=237409 width=97) (actual time=0.039..51.941
> rows=191000 loops=5)
>  Planning Time: 0.093 ms
> * Execution Time: 209.745 ms*
> (6 rows)
>

Your alias is not enclosed in double quotes, so it is downcased to "a" (as
can be seen from the alias printed in the plan).  But pg_hint_plan hints
don't follow the downcasing convention, so the hint on "A" does not match
the alias "a", and so is ignored.

Cheers,

Jeff

>


Re: Slow GroupAggregate and Sort

2024-01-03 Thread Jeff Janes
On Mon, Jan 1, 2024 at 9:57 AM Darwin Correa  wrote:

> Hello, Happy New Year! I add my responses in blue.
>
>
>
>  El Thu, 28 Dec 2023 13:06:18 -0500, *Jeff Janes
> >* escribió 
>
> I thought the point of sharding was to bring more CPU and RAM to bear than
> can feasibly be obtained in one machine.  Doesn't that make 24 shards per
> machine completely nuts?
>
>
> Based o citus docs the recommended shards is 2x cpu cores in my case I've
> tested with few shards and 1:1, 2:1 shards but always have slow query time
> in the last step (sorting and grouping) in máster node.
>

That might make sense if PostgreSQL didn't do parallelization itself.  But
according to your plan, PostgreSQL itself tries to parallelize 4 ways
(although fails, as it can't find any available workers) and then you have
24 nodes all doing the same thing, all with only 12 CPU.  That doesn't seem
good. although it now does seem unrelated to the issue at hand.


> I'd break this down into more manageable chunks for investigation.
> Populate one scratch table (on one node, not a hypertable) with all 2.6
> million rows.  See how long it takes to populate it based on the citus
> query, and separately see how long it takes to run the aggregate query on
> the populated scratch table.
>
>
> After scratch table filled sort took 32s, explain (
> https://explain.dalibo.com/plan/8a3h26hcc6328c11)
>

So that plan shows the sort to be egregiously slow, and with no involvement
of citus and no apparent reason for slowness.  I'm thinking you have a
pathological collation being used.  What is your default collation?  (Your
DDL shows that no non-default collations are in use, but doesn't indicate
what the default is)

Cheers,

Jeff


Re: Slow query when pg_trgm is in inner lopp

2018-06-20 Thread Jeff Janes
On Wed, Jun 20, 2018 at 9:21 AM, Sasa Vilic  wrote:


> Query that we have finds all routes between two set of points. A set is a
> dynamically/loosely defined by pattern given by the user input. So for
> example
> if user wants to find all routes between international airports in Austria
> toward London Heathrow, he or she would use 'LOW%' as
> :from_point_identifier
> and 'EGLL' as :to_point_identifier. Please keep in mind that is a simple
> case,
> and that user is allowed to define search term any way he/she see it fit,
> i.e. '%OW%', 'EG%'.
>


Letting users do substring searches on airport codes in the middle of a
complex query makes no sense.  Do all airports with 'OW' in the middle of
them having something in common with each other?  If people can't remember
the real airport code of the airport they are using, you should offer a
look-up tool which they can use to figure that out **before** hitting the
main query.

But taking for granted your weird use case, the most obvious improvement to
the PostgreSQL code that I can see is in the executor, not the planner.
There is no reason to recompute the bitmap on idx_point_08 each time
through the nested loop, as the outcome of that scan doesn't depend on the
outer tuple.  Presumably the reason this happens is that it is being
'BitmapAnd'ed with another bitmap index scan which does depend on the outer
tuple, and it is just not smart enough to reuse the stable bitmap while
recomputing the parameterized one.

Cheers,

Jeff


Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-14 Thread Jeff Janes
On Tue, Jul 10, 2018 at 11:07 AM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

>
>
>
> Something about the estimated row counts (this problem persisted after I
> tried ANALYZEing)
>

What is your default_statistics_target?  What can you tell us about the
distribution of parent_id?  (exponential, power law, etc?).  Can you show
the results for select * from pg_stats where tablename='a' and
attname='parent_id' \x\g\x  ?


> forces usage of the tmstmp index and Merge Append (which seems wise) but
> also a filter condition on parent_id over an index condition, which is
> apparently prohibitively slow.
>
> I tried creating a multicolumn index like:
>
> CREATE INDEX "a_partition1_parent_and_tmstmp_idx" on "a_partition2" USING
> btree ("parent_id", "tmstmp" DESC);
>
> But this didn't help (it wasn't used).
>

You could try reversing the order and adding a column to be (tmstmp,
parent_id, id) and keeping the table well vacuumed.  This would allow the
slow plan to still walk the indexes in tmstmp order but do it as an
index-only scan, so it could omit the extra trip to the table. That trip to
the table must be awfully slow to explain the numbers you show later in the
thread.

...


> This query plan (which is the same as when LIMIT is removed) has been a
> good short term solution when the number of "parent_id"s I'm using is still
> relatively small, but unfortunately queries grow untenably slow as the
> number of "parent_id"s involved increases:
>

What happens when you remove that extra order by phrase that you added?
The original slow plan should become much faster when the number of
parent_ids is large (it has to dig through fewer index entries before
accumulating 20 good ones), so you should try going back to that.

...


> I'd be very grateful for help with one or both of these questions:
> 1) Why is adding an unnecessary (from the perspective of result
> correctness) ORDER BY valuable for forcing the parent_id index usage, and
> does that indicate that there is something wrong with my
> table/indexes/statistics/etc.?
>

It finds the indexes on tmstmp to be falsely attractive, as it can walk in
tmstmp order and so avoid the sort. (Really not the sort itself, but the
fact that sort has to first read every row to be sorted, while walking an
index can abort once the LIMIT is satisfied).  Adding an extra phrase to
the ORDER BY means the index is no longer capable of delivering rows in the
needed order, so it no longer looks falsely attractive.  The same thing
could be obtained by doing a dummy operation, such as ORDER BY tmstmp + '0
seconds' DESC.  I prefer that method, as it is more obviously a tuning
trick.  Adding in "id" looks more like a legitimate desire to break any
ties that might occasionally occur in tmstmp.

As Tom pointed out, there clearly is something wrong with your statistics,
although we don't know what is causing it to go wrong.  Fixing the
statistics isn't guaranteed to fix the problem, but it would be a good
start.




> 2) Is there any way I can improve my query time when there are many
> "parent_id"s involved? I seem to only be able to get the query plan to use
> at most one of the parent_id index and the tmstmp index at a time. Perhaps
> the correct multicolumn index would help?
>

A few things mentioned above might help.

But if they don't, is there any chance you could redesign your partitioning
so that all parent_id queries together will always be in the same
partition?  And if not, could you just get rid of the partitioning
altogether?  1e7 row is not all that many and doesn't generally need
partitioning.  Unless it is serving a specific purpose, it is probably
costing you more than you are getting.

Finally, could you rewrite it as a join to a VALUES list, rather than as an
in-list?

Cheers,

Jeff


Re: Why HDD performance is better than SSD in this case

2018-07-17 Thread Jeff Janes
On Tue, Jul 17, 2018 at 1:00 AM, Neto pr  wrote:

> Dear,
> Some of you can help me understand this.
>
> This query plan is executed in the query below (query 9 of TPC-H
> Benchmark, with scale 40, database with approximately 40 gb).
>
> The experiment consisted of running the query on a HDD (Raid zero).
> Then the same query is executed on an SSD (Raid Zero).
>
> Why did the HDD (7200 rpm)  perform better?
> HDD - TIME 9 MINUTES
> SSD - TIME 15 MINUTES
>
> As far as I know, the SSD has a reading that is 300 times faster than SSD.
>

Is the 300 times faster comparing random to random, or sequential to
sequential?  Maybe your SSD simply fails to perform as advertised.  This
would not surprise me at all.

To remove some confounding variables, can you turn off parallelism and
repeat the queries?  (Yes, they will probably get slower.  But is the
relative timings still the same?)  Also, turn on track_io_timings and
repeat the "EXPLAIN (ANALYZE, BUFFERS)", perhaps with TIMINGS OFF.

Also, see how long it takes to read the entire database, or just the
largest table, outside of postgres.

Something like:

time tar -f - $PGDATA/base | wc -c

or

time cat $PGDATA/base//* | wc -c

Cheers,

Jeff


Re: Improving Performance of Query ~ Filter by A, Sort by B

2018-07-17 Thread Jeff Janes
On Mon, Jul 16, 2018 at 5:29 PM, Lincoln Swaine-Moore <
lswainemo...@gmail.com> wrote:

> Tom and Jeff,
>
> Thanks very much for the suggestions!
>
> Here's what I've found so far after playing around for a few more days:
>
> What is your default_statistics_target?  What can you tell us about the
>> distribution of parent_id?  (exponential, power law, etc?).  Can you show
>> the results for select * from pg_stats where tablename='a' and
>> attname='parent_id' \x\g\x  ?
>
>
> The default_statistics_target is 500, which I agree seems quite
> insufficient for these purposes. I bumped this up to 2000, and saw some
> improvement in the row count estimation, but pretty much the same query
> plans. Unfortunately the distribution of counts is not intended to be
> correlated to parent_id, which is one reason I imagine the histograms might
> not be particularly effective unless theres one bucket for every value.
> Here is the output you requested:
>
> select * from pg_stats where tablename='a' and attname='parent_id';
>
> schemaname | public
> tablename  | a
> attname| parent_id
> inherited  | t
> null_frac  | 0
> avg_width  | 4
> n_distinct | 18871
> most_common_vals   | {15503,49787,49786,24595,49784,17549, ...} (2000
> values)
> most_common_freqs  | {0.0252983,0.02435,0.0241317,
> 0.02329,0.019095,0.0103967,0.00758833,0.004245, ...} (2000 values)
>

You showed the 8 most common frequencies.  But could you also show the last
couple of them?  When your queried parent_id value is not on the MCV list,
it is the frequency of the least frequent one on the list which puts an
upper limit on how frequent the one you queried for can be.



> A few questions re: statistics:
>  1) would it be helpful to bump column statistics to, say, 20k (the number
> of distinct values of parent_id)?
>

Only one way to find out...
However you can only go up to 10k, not 20k.



>  2) is the discrepancy between the statistics on the parent and child
> table be expected? certainly I would think that the statistics would be
> different, but I would've imagined they would have histograms of the same
> size given the settings being the same.
>

Is the n_distinct estimate accurate for the partition?  There is an
algorithm (which will change in v11) to stop the MCV from filling the
entire statistics target size if it thinks adding more won't be useful.
But I don't know why the histogram boundary list would be short.  But, I
doubt that that is very important here.  The histogram is only used for
inequality/range, not for equality/set membership.



>  3) is there a way to manually specify the the distribution of rows to be
> even? that is, set the frequency of each value to be ~ n_rows/n_distinct.
> This isn't quite accurate, but is a reasonable assumption about the
> distribution, and might generate better query plans.
>


This would be going in the wrong direction.  Your queries seem to
preferentially use rare parent_ids, not typical parent_ids.  In fact, it
seems like many of your hard-coded parent_ids don't exist in the table at
all.  That certainly isn't going to help the planner any.  Could you
somehow remove those before constructing the query?

You might also take a step back, where is that list of parent_ids coming
from in the first place, and why couldn't you convert the list of literals
into a query that returns that list naturally?


> You could try reversing the order and adding a column to be (tmstmp,
>> parent_id, id) and keeping the table well vacuumed.  This would allow the
>> slow plan to still walk the indexes in tmstmp order but do it as an
>> index-only scan, so it could omit the extra trip to the table. That trip to
>> the table must be awfully slow to explain the numbers you show later in the
>> thread.
>
>
> Just to clarify, do you mean building indexes like:
> CREATE INDEX "a_tmstmp_parent_id_id_idx_[PART_KEY]" on
> "a_partition[PART_KEY]" USING btree("tmstmp", "parent_id", "id")
> That seems promising! Is the intuition here that we want the first key of
> the index to be the one we are ultimately ordering by? Sounds like I make
> have had that flipped initially. My understanding of this whole situation
> (and please do correct me if this doesn't make sense) is the big bottleneck
> here is reading pages from disk (when looking at stopped up queries, the
> wait_event is DataFileRead), and so anything that can be done to minimize
> the pages read will be valuable. Which is why I would love to get the query
> plan to use the tmstmp index without having to filter thereafter by
> parent_id.
>

Yes, that is the index.

You really want it to filter by parent_id in the index, rather than going
to the table to do the filter on parent_id.  The index pages with tmstmp as
the leading column are going to be more tightly packed with potentially
relevant rows, while the table pages are less likely to be densely packed.
So filtering in the index

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes
On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
wrote:

> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>
>> David Rowley  writes:
>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>> >> I found performance variance between accessing int1 and int200 column
>> which
>> >> is quite large.
>>
>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>> > that tuples are deformed starting at the first attribute. If you ask
>> > for attribute 200 then it must deform 1-199 first.
>>
>> Note that that can be optimized away in some cases, though evidently
>> not the one the OP is testing.  From memory, you need a tuple that
>> contains no nulls, and all the columns to the left of the target
>> column have to be fixed-width datatypes.  Otherwise, the offset to
>> the target column is uncertain, and we have to search for it.
>>
>
> JIT decrease a overhead of this.
>

The bottleneck here is such a simple construct, I don't see how JIT could
improve it by much.

And indeed, in my hands JIT makes it almost 3 times worse.

Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
execution and 4594.994 ms for the JIT=off.

Cheers,

Jeff
drop table if exists i200c200;
create table i200c200 ( pk bigint primary key, 
int1 bigint,
int2 bigint,
int3 bigint,
int4 bigint,
int5 bigint,
int6 bigint,
int7 bigint,
int8 bigint,
int9 bigint,
int10 bigint,
int11 bigint,
int12 bigint,
int13 bigint,
int14 bigint,
int15 bigint,
int16 bigint,
int17 bigint,
int18 bigint,
int19 bigint,
int20 bigint,
int21 bigint,
int22 bigint,
int23 bigint,
int24 bigint,
int25 bigint,
int26 bigint,
int27 bigint,
int28 bigint,
int29 bigint,
int30 bigint,
int31 bigint,
int32 bigint,
int33 bigint,
int34 bigint,
int35 bigint,
int36 bigint,
int37 bigint,
int38 bigint,
int39 bigint,
int40 bigint,
int41 bigint,
int42 bigint,
int43 bigint,
int44 bigint,
int45 bigint,
int46 bigint,
int47 bigint,
int48 bigint,
int49 bigint,
int50 bigint,
int51 bigint,
int52 bigint,
int53 bigint,
int54 bigint,
int55 bigint,
int56 bigint,
int57 bigint,
int58 bigint,
int59 bigint,
int60 bigint,
int61 bigint,
int62 bigint,
int63 bigint,
int64 bigint,
int65 bigint,
int66 bigint,
int67 bigint,
int68 bigint,
int69 bigint,
int70 bigint,
int71 bigint,
int72 bigint,
int73 bigint,
int74 bigint,
int75 bigint,
int76 bigint,
int77 bigint,
int78 bigint,
int79 bigint,
int80 bigint,
int81 bigint,
int82 bigint,
int83 bigint,
int84 bigint,
int85 bigint,
int86 bigint,
int87 bigint,
int88 bigint,
int89 bigint,
int90 bigint,
int91 bigint,
int92 bigint,
int93 bigint,
int94 bigint,
int95 bigint,
int96 bigint,
int97 bigint,
int98 bigint,
int99 bigint,
int100 bigint,
int101 bigint,
int102 bigint,
int103 bigint,
int104 bigint,
int105 bigint,
int106 bigint,
int107 bigint,
int108 bigint,
int109 bigint,
int110 bigint,
int111 bigint,
int112 bigint,
int113 bigint,
int114 bigint,
int115 bigint,
int116 bigint,
int117 bigint,
int118 bigint,
int119 bigint,
int120 bigint,
int121 bigint,
int122 bigint,
int123 bigint,
int124 bigint,
int125 bigint,
int126 bigint,
int127 bigint,
int128 bigint,
int129 bigint,
int130 bigint,
int131 bigint,
int132 bigint,
int133 bigint,
int134 bigint,
int135 bigint,
int136 bigint,
int137 bigint,
int138 bigint,
int139 bigint,
int140 bigint,
int141 bigint,
int142 bigint,
int143 bigint,
int144 bigint,
int145 bigint,
int146 bigint,
int147 bigint,
int148 bigint,
int149 bigint,
int150 bigint,
int151 bigint,
int152 bigint,
int153 bigint,
int154 bigint,
int155 bigint,
int156 bigint,
int157 bigint,
int158 bigint,
int159 bigint,
int160 bigint,
int161 bigint,
int162 bigint,
int163 bigint,
int164 bigint,
int165 bigint,
int166 bigint,
int167 bigint,
int168 bigint,
int169 bigint,
int170 bigint,
int171 bigint,
int172 bigint,
int173 bigint,
int174 bigint,
int175 bigint,
int176 bigint,
int177 bigint,
int178 bigint,
int179 bigint,
int180 bigint,
int181 bigint,
int182 bigint,
int183 bigint,
int184 bigint,
int185 bigint,
int186 bigint,
int187 bigint,
int188 bigint,
int189 bigint,
int190 bigint,
int191 bigint,
int192 bigint,
int193 bigint,
int194 bigint,
int195 bigint,
int196 bigint,
int197 bigint,
int198 bigint,
int199 bigint,
int200 bigint,
char1 varchar(255),
char2 varchar(255),
char3 varchar(255),
char4 varchar(255),
char5 varchar(255),
char6 varchar(255),
char7 varchar(255),
char8 varchar(255),
char9 varchar(255),
char10 varchar(255),
char11 varchar(255),
char12 varchar(255),
char13 varchar(255),
char14 varchar(255),
char15 varchar(255),
char16 varchar(255),
char17 varchar(255),
char18 varchar(255),
char19 varchar(255),
char20 varchar(255),
char21 varchar(255),
char22 varchar(255),
char23 varchar(255),
char24 varchar(255),
char25 varchar(255),
char26 varchar(255),
char27 varchar(255),
char28 varchar(255),
char29 varchar(255),
char30 varchar(255),
char31 varchar(255),
char32 varchar(255),
char33 varchar(255),
char34 varchar(255),
char35 varchar(255),
char36 varchar(255),
char37 varchar(255),
char38 varchar(255),
char39 varchar(255),
char40 varchar(255),
char41 varchar(255),
char42 varchar(

Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-30 Thread Jeff Janes
On Mon, Jul 30, 2018 at 12:01 PM, Pavel Stehule 
wrote:

>
>
> 2018-07-30 13:19 GMT+02:00 Jeff Janes :
>
>> On Mon, Jul 30, 2018 at 12:11 AM, Pavel Stehule 
>> wrote:
>>
>>> 2018-07-30 1:00 GMT+02:00 Tom Lane :
>>>
>>>> David Rowley  writes:
>>>> > On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
>>>> >> I found performance variance between accessing int1 and int200
>>>> column which
>>>> >> is quite large.
>>>>
>>>> > Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
>>>> > that tuples are deformed starting at the first attribute. If you ask
>>>> > for attribute 200 then it must deform 1-199 first.
>>>>
>>>> Note that that can be optimized away in some cases, though evidently
>>>> not the one the OP is testing.  From memory, you need a tuple that
>>>> contains no nulls, and all the columns to the left of the target
>>>> column have to be fixed-width datatypes.  Otherwise, the offset to
>>>> the target column is uncertain, and we have to search for it.
>>>>
>>>
>>> JIT decrease a overhead of this.
>>>
>>
>> The bottleneck here is such a simple construct, I don't see how JIT could
>> improve it by much.
>>
>> And indeed, in my hands JIT makes it almost 3 times worse.
>>
>> Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
>> execution and 4594.994 ms for the JIT=off.
>>
>
> look on http://www.postgresql-archive.org/PATCH-LLVM-tuple-
> deforming-improvements-td6029385.html thread, please.
>
>
The opt1 patch did get performance back to "at least do no harm" territory,
but it didn't improve over JIT=off.  Adding the other two didn't get any
further improvement.

I don't know where the time is going with the as-committed JIT.  None of
the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
close to the slow-down I'm seeing.  Shouldn't compiling and optimization
time show up there?

Cheers,

Jeff


Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-31 Thread Jeff Janes
On Mon, Jul 30, 2018 at 1:23 PM, Andres Freund  wrote:

> On 2018-07-30 07:19:07 -0400, Jeff Janes wrote:
>
> > And indeed, in my hands JIT makes it almost 3 times worse.
>
> Not in my measurement. Your example won't use JIT at all, because it's
> below the cost threshold. So I think you might just be seeing cache +
> hint bit effects?
>

No, it is definitely JIT.  The explain plans show it, and the cost of the
query is 230,000 while the default setting of jit_above_cost is 100,000.
It is fully reproducible by repeatedly toggling the JIT setting.  It
doesn't seem to be the cost of compiling the code that slows it down (I'm
assuming the code is compiled once per tuple descriptor, not once per
tuple), but rather the efficiency of the compiled code.



>
> > Run against ab87b8fedce3fa77ca0d6, I get 12669.619 ms for the 2nd JIT
> > execution and 4594.994 ms for the JIT=off.
>
> Even with a debug LLVM build, which greatly increases compilation
> overhead, I actually see quite the benefit when I force JIT to be used:
>

I don't see a change when I compile without --enable-debug,
and jit_debugging_support is off, or in 11beta2 nonexistent.  How can I
know if I have a debug LLVM build, and turn it off if I do?


>
>
> postgres[26832][1]=# ;SET jit_above_cost = -1; set jit_optimize_above_cost
> = 0; set jit_inline_above_cost = 0;
> postgres[26832][1]=# explain (analyze, buffers, timing off) select pk,
> int200 from i200c200;
>

Lowering jit_optimize_above_cost does redeem this for me.  It brings it
back to being a tie with JIT=OFF.  I don't see any further improvement by
lowering jit_inline_above_cost, and overall it is just a statistical tie
with JIT=off, not an improvement as you get, but at least it isn't a
substantial loss.

Under what conditions would I want to do jit without doing optimizations on
it?  Is there a rule of thumb that could be documented, or do we just use
the experimental method for each query?

I don't know how sensitive JIT is to hardware.  I'm using Ubuntu 16.04 on
VirtualBox (running on Windows 10) on an i5-7200U, which might be important.

I had previously done a poor-man's JIT where I created 4 versions of the
main 'for' loop in slot_deform_tuple.  I did a branch on "if(hasnulls)",
and then each branch had two loops, one for when 'slow' is false, and then
one for after 'slow' becomes true so we don't have to keep setting it true
again once it already is, in a tight loop.  I didn't see noticeable
improvement there (although perhaps I would have on different hardware), so
didn't see how JIT could help with this almost-entirely-null case.  I'm not
trying to address JIT in general, just as it applies to this particular
case.

Unrelated to JIT and relevant to the 'select pk, int199' case but not the
'select pk, int200' case, it seems we have gone to some length to make slot
deforming be efficient for incremental use, but then just deform in bulk
anyway up to maximum attnum used in the query, at least in this case.  Is
that because incremental deforming is not cache efficient?

Cheers,

Jeff


Re: Performance difference in accessing differrent columns in a Postgres Table

2018-08-01 Thread Jeff Janes
On Mon, Jul 30, 2018 at 3:02 PM, Andres Freund  wrote:

> Hi,
>
> On 2018-07-30 13:31:33 -0400, Jeff Janes wrote:
> > I don't know where the time is going with the as-committed JIT.  None of
> > the JIT-specific timings reported by EXPLAIN (ANALYZE) add up to anything
> > close to the slow-down I'm seeing.  Shouldn't compiling and optimization
> > time show up there?
>
> As my timings showed, I don't see the slowdown you're reporting. Could
> you post a few EXPLAIN ANALYZEs?
>


I don't think you showed any timings where jit_above_cost < query cost <
jit_optimize_above_cost, which is where I saw the slow down.  (That is also
where things naturally land for me using default settings)

I've repeated my test case on a default build (./configure --with-llvm
--prefix=) and default postgresql.conf, using the post-11BETA2 commit
5a71d3e.


I've attached the full test case, and the full output.

Here are the last two executions, with jit=on and jit=off, respectively.
Doing it with TIMING OFF doesn't meaningfully change things, nor does
increasing shared_buffers beyond the default.



QUERY PLAN
--
 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=29.317..11966.291 rows=1000 loops=1)
 Planning Time: 0.034 ms
 JIT:
   Functions: 2
   Generation Time: 1.589 ms
   Inlining: false
   Inlining Time: 0.000 ms
   Optimization: false
   Optimization Time: 9.002 ms
   Emission Time: 19.948 ms
 Execution Time: 12375.493 ms
(11 rows)

Time: 12376.281 ms (00:12.376)
SET
Time: 1.955 ms
   QUERY PLAN

 Seq Scan on i200c200  (cost=0.00..22.28 rows=828 width=16) (actual
time=0.063..3897.302 rows=1000 loops=1)
 Planning Time: 0.037 ms
 Execution Time: 4292.400 ms
(3 rows)

Time: 4293.196 ms (00:04.293)

Cheers,

Jeff


wide.sql
Description: Binary data


wide.out
Description: Binary data


Re: Bi-modal streaming replication throughput

2018-08-14 Thread Jeff Janes
On Tue, Aug 14, 2018 at 9:18 AM, Alexis Lê-Quôc  wrote:

>
each
 running PG 9.3
 on linux


That is the oldest version which is still supported.  There have been a lot
of improvements since then, including to performance.  You should see if an
upgrade solves the problem.  If not, at least you will have access to
better tools (like pg_stat_activity.wait_event_type), and people will be
more enthusiastic about helping you figure it out knowing it is not an
already-solved problem.


>
> Here are some settings that may help and a perf profile of a recovery
> process that runs without any competing read traffic processing the INSERT
> backlog (I don't unfortunately have the same profile on a lagging read
> replica).
>

Unfortunately the perf when the problem is not occuring won't be very
helpful.  You need it from when the problem is occurring.  Also, I find
strace and gdb to more helpful than perf in this type of situation where
you already know it is not CPU bound, although perhaps that is just my own
lack of skill with perf. You need to know why it is not on the CPU, not
what it is doing when it is on the CPU.

Where the settings you showed all of the non-default settings?

I assume max_standby_streaming_delay is at the default value of 30s?  Are
you getting query cancellations due conflicts with recovery, or anything
else suspicious in the log?  What is the maximum lag you see measured in
seconds?

Cheers,

Jeff


Re: trying to delete most of the table by range of date col

2018-09-03 Thread Jeff Janes
>
> 4)delete in chunks :
> do $$
> declare
> rec integer;
> begin
> select count(*) from my_table into rec where end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/');
> while rec > 0 loop
> DELETE FROM my_Table WHERE id IN (select id from my_tablewhere end_date <=
> to_date('12/12/2018','DD/MM/') and end_date >
> to_date('11/12/2018','DD/MM/') limit 5000);
> rec := rec - 5000;
> raise notice '5000 records were deleted, current rows :%',rec;
> end loop;
>
> end;
> $$
> ;
>
> Execution time : 6 minutes.
>
> So, it seems that the second solution is the fastest one. It there a
> reason why the delete chunks (solution 4) wasnt faster?
>

Why would it be faster?  The same amount of work needs to get done, no
matter how you slice it.  Unless there is a specific reason to think it
would be faster, I would expect it won't be.

If you aren't willing to drop the constraints, then I think you just need
to resign yourself to paying the price of checking those constraints. Maybe
some future version of PostgreSQL will be able to do them in parallel.

Cheers,

Jeff


Re: Query is slow when run for first time; subsequent execution is fast

2018-09-04 Thread Jeff Janes
On Tue, Sep 4, 2018 at 3:16 AM jimmy  wrote:

> On windows, how to put an entry in my db startup script to run this query
> (pg_prewarm) immediately after startng the server, and let the query warm
> the cache itself.
>

Starting with PostgreSQL version 11 (to be released soon), you can use
 pg_prewarm.autoprewarm.

Until then, maybe this:
https://superuser.com/questions/502160/run-a-scheduled-task-after-a-windows-service-is-started

I've tested neither one.

Cheers,

Jeff


Re: Performance difference in accessing differrent columns in a Postgres Table

2018-09-05 Thread Jeff Janes
On Wed, Sep 5, 2018 at 12:21 AM Dinesh Kumar  wrote:

> Hi All,
> I was wondering whether the case is solved or still continuing. As a
> Postgres newbie, I can't understand any of the terms (JIT, tuple
> deformation) as you mentioned above. Please anyone let me know , what is
> the current scenario.
>
>
JIT is a just-in-time compilation, which will be new in v11.  Tuple
deforming is how you get the row from the on-disk format to the in-memory
format.

Some people see small improvements in tuple deforming using JIT in your
situation, some see large decreases, depending on settings and apparently
on hardware.  But regardless, JIT is not going to reduce your particular
use case (many nullable and actually null columns, referencing a
high-numbered column) down to being constant-time operation in the number
of preceding columns.  Maybe JIT will reduce the penalty for accessing a
high-numbered column by 30%, but won't reduce the penalty by 30 fold.  Put
your NOT NULL columns first and then most frequently accessed NULLable
columns right after them, if you can.

Cheers,

Jeff

>


Re: Performance difference in accessing differrent columns in a Postgres Table

2018-09-05 Thread Jeff Janes
On Wed, Sep 5, 2018 at 12:00 PM Jeff Janes  wrote:

> On Wed, Sep 5, 2018 at 12:21 AM Dinesh Kumar  wrote:
>
>> Hi All,
>> I was wondering whether the case is solved or still continuing. As a
>> Postgres newbie, I can't understand any of the terms (JIT, tuple
>> deformation) as you mentioned above. Please anyone let me know , what is
>> the current scenario.
>>
>>
> JIT is a just-in-time compilation, which will be new in v11.  Tuple
> deforming is how you get the row from the on-disk format to the in-memory
> format.
>
> Some people see small improvements in tuple deforming using JIT in your
> situation, some see large decreases, depending on settings and apparently
> on hardware.  But regardless, JIT is not going to reduce your particular
> use case (many nullable and actually null columns, referencing a
> high-numbered column) down to being constant-time operation in the number
> of preceding columns.  Maybe JIT will reduce the penalty for accessing a
> high-numbered column by 30%, but won't reduce the penalty by 30 fold.  Put
> your NOT NULL columns first and then most frequently accessed NULLable
> columns right after them, if you can.
>

Correction: NOT NULL columns with fixed width types first.  Then of the
columns which are either nullable or variable width types, put the most
frequently accessed earlier.


Re: query gets very slow when :jsonb ?& operator is used

2018-09-06 Thread Jeff Janes
On Thu, Sep 6, 2018 at 7:52 PM  wrote:

> I have also asked this question on Stackoverflow and DBA stack exchange
> with no answer. It's a fairly long post, so I will post a link to it, as on
> Stackoverflow it is formatted nicely
>
>
> https://stackoverflow.com/questions/52212878/query-gets-very-slow-when-jsonb-operator-is-used
>
> Any idea why my query slows down so much when I add 
> account.residence_details::jsonb
> ?& array['city', 'state', 'streetName'] ?
>

The planner has no insight into what fraction of rows will satisfy the ?&
condition, and falls back on the assumption that  very few will.  This is
(apparently) a very bad assumption, and causes it choose a bad plan.

Rewriting the `phone_number.account_id IN (subquery)` into an exists query
might help.

Cheers,

Jeff


Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes
On Fri, Sep 7, 2018 at 8:00 AM Patrick Molgaard  wrote:

> Hi folks,
>
> I've been seeing some curious behaviour on a postgres server I administer.
>
> Intermittently (one or two times a week), all queries on that host are
> simultaneously blocked for extended periods (10s of seconds).
>
> The blocked queries are trivial & not related to locking - I'm seeing
> slowlogs of the form:
>
> `LOG: duration: 22627.299 ms statement: SET client_encoding='''utf-8''';`
>
>
Do you have log_lock_waits set to on?  If not, you might want to turn it on.

Cheers,

Jeff


Re: Multi-second pauses blocking even trivial activity

2018-09-07 Thread Jeff Janes
On Fri, Sep 7, 2018 at 2:03 PM Patrick Molgaard  wrote:

>
> Hi Jeff,
>
> Thanks for your reply. Are locks relevant in this case, though?
>

I don't know, but why theorize when we can know for sure?  It at least
invokes VirtualXactLockTableInsert.  I don't see how that could block on a
heavyweight lock, though. But again, why theorize when logging it is simple?

Is it always the first statement in a connection which is blocking, or will
established connections also block at the same time the new ones start to
block?

Cheers,

Jeff

>


Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-19 Thread Jeff Janes
On Wed, Sep 19, 2018 at 5:19 AM Sam R.  wrote:

> Hi!
>
> Is is possible to force PostgreSQL to keep an index in memory?
>

It might be possible to put the indexes in a separate tablespace, then do
something at the file-system level to to force the OS cache to keep pages
for that FS in memory.



> The data in db table columns is not needed to be kept in memory, only the
> index. (hash index.)
>

This sounds like speculation.  Do you have hard evidence that this is
actually the case?


>
> It would sound optimal in our scenario.
> I think Oracle has capability to keep index in memory (in-memory db
> functionality). But does PostgreSQL have such a functionality? (I keep
> searching.)
>

There are a lot of Oracle capabilities which encourage people to
micromanage the server in ways that are almost never actually productive.

Should I actually set shared_buffers to tens of gigabytes also, if I want
> to keep one very big index in memory?
>

If your entire database fits in RAM, then it could be useful to set
shared_buffers high enough to fit the entire database.

If fitting the entire database in RAM is hopeless, 10s of gigabytes is
probably too much, unless you have 100s of GB of RAM. PostgreSQL doesn't do
direct IO, but rather uses the OS file cache extensively.  This leads to
double-buffering, where a page is read from disk and stored in the OS file
cache, then handed over to PostgreSQL where it is also stored in
shared_buffers. That means that 1/2 of RAM is often the worse value for
shared_buffers.  You would want it to be either something like 1/20 to 1/10
of RAM, or something like 9/10 or 19/20 of RAM, so that you concentrate
pages into one of the caches or the other.  The low fraction of RAM is the
more generally useful option.  The high fraction of RAM is useful when you
have very high write loads, particularly intensive index updating--and in
that case you probably need someone to intensively monitor and baby-sit the
database.

Cheers,

Jeff


Re: Gained %20 performance after disabling bitmapscan

2018-10-19 Thread Jeff Janes
On Fri, Oct 19, 2018 at 3:19 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Hi all,
>
> I have a problem with my query. Query always using parallel bitmap heap
> scan. I've created an index with all where conditions and id but query does
> not this index and continue to use bitmapscan. So I decided disable bitmap
> scan for testing. And after that, things became strange. Cost is higher,
> execution time is lower.
>

A 20% difference in speed is unlikely to make or break you.  Is it even
worth worrying about?


> But I want to use index_only_scan because index have all column that query
> need. No need to access table.
>

Your table is not very well vacuumed, so there is need to access it (9010
times to get 6115 rows, which seems like quite an anti-feat; but I don't
know which of those numbers are averaged over loops/parallel workers,
versus summed over them). Vacuuming your table will not only make the
index-only scan look faster to the planner, but also actually be faster.

The difference in timing could easily be down to one query warming the
cache for the other.  Are these timings fully reproducible altering
execution orders back and forth?  And they have different degrees of
parallelism, what happens if you disable parallelism to simplify the
analysis?


> It is doing index_only_scan when disabling bitmap scan but I cannot
> disable bitmap scan for cluster wide. There are other queries...
> Can you help me to solve the issue?
>
>
Cranking up effective_cache_size can make index scans look better in
comparison to bitmap scans, without changing a lot of other stuff.  This
still holds even for index-only-scan, in cases where the planner knows the
table to be poorly vacuumed.

But moving the column tested for inequality to the end of the index would
be probably make much more of  a difference, regardless of which plan it
chooses.

Cheers,

Jeff

>


Re: Gained %20 performance after disabling bitmapscan

2018-10-26 Thread Jeff Janes
On Mon, Oct 22, 2018 at 3:20 AM Yavuz Selim Sertoglu <
yavuzselim.serto...@medyasoft.com.tr> wrote:

> Thanks for the reply Jeff,
>
> I know 20ms is nothing but it shows me that there is a problem with my
> configuration. I want to find it.
>

This is a dangerous assumption.  This is no configuration you can come up
with which will cause the planner to be within 20% of perfection in all
cases.  Given the other plans you've shown and discussed, I think this is
just chasing our own tail.

Cheers,

Jeff

>


Re: High CPU Usage of "SET ROLE"

2018-10-30 Thread Jeff Janes
On Tue, Oct 30, 2018 at 3:50 PM Ulf Lohbrügge 
wrote:


> When I use the psql cli on the same database I can see via "\timing" that
> the first statement after "RESET ROLE;" is significantly slower. I was even
> able to strip it down to two statements ("SET ROLE ...;" and "RESET ROLE;"):
>
> ...
>
Maybe my observations here are already sufficient to find out what happens
> here? I guess that my setup with 1k rows in pg_roles and 1.5m rows in
> pg_class is probably the cause.
>

It would probably be enough if it were reproducible, but I can't reproduce
it.

-- set up
perl -le 'print "create user foo$_;" foreach 1..1000'|psql
perl -le 'foreach $r (1..1000) {print "create schema foo$r authorization
foo$r;"}'|psql
perl -le 'foreach $r (reverse 1..1000) {print "set role foo$r;"; print
"create table foo$r.foo$_ (x serial primary key);" foreach 1..1000;}'|psql
> out

-- test
perl -le 'print "set role foo$_;\nreset role;" foreach 1..1000'|psql

Does it help when I create a test setup with a docker image that contains a
> database with that many entries in pg_roles and pg_class and share it here?
>

If you have a script to create the database, I'd be more likely to play
around with that than with a docker image.  (Which I have to guess would be
quite large anyway, with 1.5 rows in pg_class)

Cheers,

Jeff

>


Re: Optimizer choosing the wrong plan

2018-11-26 Thread Jeff Janes
On Mon, Nov 26, 2018 at 5:11 AM Viswanath  wrote:

> *Postgres server version -  9.5.10*
> *RAM - 128 GB*
> *WorkMem 64 MB*
>
> *Problematic query with explain :*
> *Query 1 (original):*
> explain analyse SELECT myTable1.ID FROM myTable1 LEFT JOIN myTable2 ON
> myTable1.ID=myTable2.ID WHERE  myTable1.bool_val = true) AND
> (myTable1.small_intval IN (1,2,3))) AND ((*myTable2.bigint_val = 1*) AND
> (myTable1.bool_val = true))) AND (((myTable1.ID >= 1) AND
> (myTable1.ID <= 1)) ))  ORDER BY 1 DESC , 1 NULLS FIRST  LIMIT
> 11;
>

There is no point doing a LEFT JOIN when the NULL-extended rows get
filtered out later.

Also, ordering by the same column twice is peculiar, to say the least.


> The table myTable2 contains *12701952* entries. Out of which only *86227*
> is
> not null and *146* entries are distinct.
>

I assume you mean the column myTable2.ID has that many not null and
distinct?


>
> The above query returns 0 rows since 'myTable2.bigint_val = 1' criteria
> satisfies nothing. It takes 6 seconds for execution as the planner chooses*
> myTable1.ID column's index*.


More importantly, it chooses the index on myTable2.ID.  It does also use
the index on myTable1.ID, but that is secondary.

The ideal index for this query would probably be a composite index on
myTable2 (bigint_val, id DESC);
The planner would probably choose to use that index, even if the statistics
are off.

I tried running *vacuum analyse* table many times, tried changing the
> *statistics target of the column to 250 (since there are only 149 distinct
> values)*. But none worked out. The planner thinks that there are *1727*
> rows
> that matches the condition *myTable2.bigint_val = 1* but there are none.
>

It would interesting if you can upgrade a copy of your server to v11 and
try it there.  We made changes to ANALYZE in that version which were
intended to improve this situation, and it would be nice to know if it
actually did so for your case.

Also, increasing statistics target even beyond 250 might help.  If every
one of the observed value is seen at least twice, it will trigger the
system to assume that it has observed all distinct values that exist.  But
if any of the values are seen exactly once, that causes it to take a
different path (the one which got modified in v11).

Cheers,

Jeff


Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0

2018-11-27 Thread Jeff Janes
On Tue, Nov 27, 2018 at 9:17 AM Sanyo Moura  wrote:

> Hi,
>
> I'm running performance tests for my application at version 11.1 and
> encountered
> queries with high planning time compared to the same planning, running at
> versions 10.5 and 11.0.
>

Can you reproduce the regression if the tables are empty?  If so, can you
share the create script that creates the tables?

Cheers,

Jeff

>


Re: SQL Perfomance during autovacuum

2018-12-21 Thread Jeff Janes
On Wed, Dec 19, 2018 at 1:04 AM anand086  wrote:

>
> The Execution time for the above sql is  17841.467 ms during normal
> operations but when autovacuum runs on table test_table, the same sql took
> 1628495.850 ms (from the postgres log).
>
> We have noticed this increase in execution times for the sqls only when
> autovacuum runs and it runs with prevent wraparound mode.


Some competition for resource is to be expected with autovacuum, but making
a one-hundred fold difference in run time is rather extreme. I'd suggest
that what you have is a locking issue.  Something is trying to take a brief
Access Exclusive lock on the table.  It blocks on the lock held by the
autovacuum, and then the Access Share lock needed for your query blocks
behind that.

Normally an autovacuum will yield the lock when it notices it is blocking
something else, but will not do so for wraparound.

If you have log_lock_waits turned on, you should see some evidence in the
log file if this is the case.

Cheers,

Jeff


Re: PostgreSQL Read IOPS limit per connection

2018-12-27 Thread Jeff Janes
>
>
> *Performance issue:*
>
> I’m trying to figure out if PostgreSQL (PG) has some inherent limit on
> IOPS per connection.
>
> Running pgbench with multiple clients (-c 30) we are able to see 20K+ IOPS
> , which is what we expect. But, if we use just one client, we get 1200
> IOPS, avg disk queue size around 1:
>

The default transaction done by pgbench simply has no opportunity for
dispatching multiple io requests per connection.  It just a series of
single-row lookups and single-row updates or inserts.  You will have to use
a different benchmark if you want to exercise this area.  Probably
something analytics heavy.

Also, you would want to use the newest version of PostgreSQL, as 9.6
doesn't have parallel query, which is much more generally applicable than
effective_io_concurrency is.

One of the issues I’m trying to solve is related to extracting data from a
> large table, which users a full table scan. We see the same 1200 IOPS limit
> of pgbench when we SELECT on this table using just one connection. If there
> is a limitation per connection, I might set up the application to have
> several connections, and then issue SELECTs for different sections of the
> table, and later join the data, but it looks cumbersome, especially if the
> DB can do extract data using more IOPS.
>
The kernel should detect a sequential read in progress and invoke
readahead.  That should be able to keep the CPU quite busy with data for
any decent IO system.  Are you sure IO is even the bottleneck for your
query?

Perhaps your kernel readahead settings need to be tuned.  Also, you may
benefit from parallel query features implemented in newer versions of
PostgreSQL.  In any event, the default transactions of pgbench are not
going to be useful for benchmarking what you care about.

Cheers,

Jeff


Re: Optimizer choosing the wrong plan

2018-12-29 Thread Jeff Janes
On Sat, Dec 29, 2018 at 7:17 AM Jim Finnerty  wrote:


> Jeff, can you describe the changes that were made to ANALYZE in v11,
> please?
>
> I've found that running ANALYZE on v10 on the Join Order Benchmark, using
> the default statistics target of 100, produces quite unstable results, so
> I'd be interested to hear what has been improved in v11.
>

There are two paths the code can take.  One if all values which were
sampled at all were sampled at least twice, and another if the
least-sampled value was sampled exactly once.  For some distributions (like
exponential-ish or maybe power-law), it is basically a coin flip whether
the least-sampled value is seen once, or more than once.  If you are seeing
instability, it is probably for this reason.  That fundamental instability
was not addressed in v11.

Once you follow the "something seen exactly once" path, it has to decide
how many of the values get represented in the most-common-value list.  That
is where the change was.  The old method said a value had to have an
estimated prevalence at least 25% more than the average estimated
prevalence to get accepted into the list.  The problem is that if there
were a few dominant values, it wouldn't be possible for any others to be
"over-represented" because those few dominant values dragged the average
prevalence up so far nothing else could qualify.  What it was changed to
was to include a value in the most-common-value list if its
overrepresentation was statistically significant given the sample size.
The most significant change (from my perspective) is that
over-representation is measured not against all values, but only against
all values more rare in the sample then the one currently being considered
for inclusion into the MCV.  The old method basically said "all rare values
are the same", while the new method realizes that a rare value present
10,000 times in a billion row table is much different than a rare value
present 10 time in a billion row table.

It is possible that this change will fix the instability for you, because
it could cause the "seen exactly once" path to generate a MCV list which is
close enough in size to the "seen at least twice" path that you won't
notice the difference between them anymore.  But, it is also possible they
will still be different enough in size that it will still appear unstable.
It depends on your distribution of values.

Cheers,

Jeff


Re: Query Performance Issue

2018-12-29 Thread Jeff Janes
On Sat, Dec 29, 2018 at 1:58 AM David Rowley 
wrote:

> On Sat, 29 Dec 2018 at 04:32, Justin Pryzby  wrote:
> > I think the solution is to upgrade (at least) to PG10 and CREATE
> STATISTICS
> > (dependencies).
>
> Unfortunately, I don't think that'll help this situation. Extended
> statistics are currently only handled for base quals, not join quals.
> See dependency_is_compatible_clause().
>
>
But "recommended_content_id" and "version" are both in the same table,
doesn't that make them base quals?

The most obvious thing to me would be to vacuum
product_content_recommendation_main2 to get rid of the massive number of
heap fetches.  And to analyze everything to make sure the estimation errors
are not simply due to out-of-date stats.  And to increase work_mem.

It isn't clear we want to get rid of the nested loop, from the info we have
to go on the hash join might be even slower yet.  Seeing the plan with
enable_nestloop=off could help there.

Cheers,

Jeff


Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes
On Wed, Jan 9, 2019 at 3:35 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

>
> Now, In machine 1 when I run psql I get the prompt password but in machine
> 2 I keep getting the next error :
>
> psql: could not connect to server: No such file or directory
> Is the server running locally and accepting
> connections on Unix domain socket
> "/var/run/postgresql/.s.PGSQL.5432"?
>
> One important thing that I didnt mention, is that I installed in machine 2
> package postgresql-libs.x86_64 0:8.4.20-8.el6_9 from the postgres
> repository (in order to upgrade it to 9.6).
>

The front end and the backend have compiled-in defaults for the socket
directory.  If you installed them from different sources, they may have
different compiled-in defaults.  Which means they may not be able to
rendezvous using the default settings for both of them.

You can override the default using unix_socket_directory on the server (as
you discovered).  On the client you can override it by using -h (or PGHOST
or host= or whatever mechanism), with an argument that looks like a
directory, starting with a '/'.

Cheers,

Jeff


Re: postgresql unix socket connections

2019-01-09 Thread Jeff Janes
On Wed, Jan 9, 2019 at 10:09 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Hey Tom,
> I'm aware of how I can solve it. I wanted to understand why after
> installing the pg 9.6 packages suddenly psql tries to access the socket on
> /var/run/postgresql. Does the libpq default unix socket is changed between
> those two versions ? (9.6,9.2)
>

It is not a version issue, but a packaging issue.  Different systems have
different conventions on where sockets should go, and the packager imposes
their opinion on the things they package.

Cheers,

Jeff


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Jeff Janes
>
>
> You could also try pg_test_fsync to get low-level information, to
>> supplement the high level you get from pgbench.
>
>
> Thanks for pointing me to this tool. never knew pg_test_fsync existed!
> I've run `pg_test_fsync -s 60` two times and this is the output -
> https://gist.github.com/saurabhnanda/b60e8cf69032b570c5b554eb50df64f8 I'm
> not sure what to make of it?
>

I don't know what to make of that either.  I'd expect fdatasync using two
8kB writes to be about the same throughput as using one 8kB write, but
instead it is 4 times slower.  Also, I'd expect open_datasync to get slower
by a factor of 2, not a factor of 8, when going from one to two 8kB writes
(that is not directly relevant, as you aren't using open_datasync, but is
curious nonetheless).  Is this reproducible with different run lengths?  I
wonder if your write cache (or something) gets "tired" during the first
part of pg_test_fsync and thus degrades the subsequent parts of the test.
I would say something in your IO stack is not optimal, maybe some component
is "consumer grade" rather than "server grade".  Maybe you can ask Hetzner
about that.


> The effects of max_wal_size are going to depend on how you have IO
>> configured, for example does pg_wal shared the same devices and controllers
>> as the base data?  It is mostly about controlling disk usage and
>> crash-recovery performance, neither of which is of primary importance to
>> pgbench performance.
>
>
>  The WAL and the data-directory reside on the same SSD disk -- is this a
> bad idea?
>

If you are trying to squeeze out every last bit of performance, then I
think it is bad idea.  Or at least, something to try the alternative and
see.  The flushing that occurs during checkpoints and the flushing that
occurs for every commit can interfere with each other.


> I was under the impression that smaller values for max_wal_size cause
> pg-server to do "maintenance work" related to wal rotation, etc. more
> frequently and would lead to lower pgbench performance.
>

If you choose ridiculously small values it would.  But once the value is
sufficient, increasing it further wouldn't do much.  Given your low level
of throughput, I would think the default is already sufficient.

Thanks for including the storage info.  Nothing about it stands out to me
as either good or bad, but I'm not a hardware maven; hopefully one will be
reading along and speak up.


> PS: Cc-ing the list back again because I assume you didn't intend for your
> reply to be private, right?
>

Yes, I had intended to include the list but hit the wrong button, sorry.

Cheers,

Jeff

>


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-27 Thread Jeff Janes
On Sun, Jan 27, 2019 at 2:39 AM Saurabh Nanda 
wrote:

>
>> PGOPTIONS="-c synchronous_commit=off" pgbench -T 3600 -P 10 
>>
>>
>> I am currently running all my benchmarks with synchronous_commit=off and
>> will get back with my findings.
>>
>
>
> It seems that PGOPTIONS="-c synchronous_commit=off" has a significant
> impact.
>

It is usually not acceptable to run applications with
synchronous_commit=off, so once you have identified that the bottleneck is
in implementing synchronous_commit=on, you probably need to take a deep
dive into your hardware to figure out why it isn't performing the way you
need/want/expect it to.  Tuning the server under synchronous_commit=off
when you don't intend to run your production server with that setting is
unlikely to be fruitful.


> However, I still can not understand why the TPS for the optimised case is
LOWER than the default for higher concurrency levels!

In case you do intend to run with synchronous_commit=off, or if you are
just curious:  running with a very high number of active connections often
reveals subtle bottlenecks and interactions, and is very dependent on your
hardware.  Unless you actually intend to run our server with
synchronous_commit=off and with a large number of active connections, it is
probably not worth investigating this.  You can make a hobby of it, of
course, but it is a time consuming hobby to have.  If you do want to, I
think you should start out with your optimized settings and revert them one
at a time to find the one the caused the performance regression.

I'm more interested in the low end, you should do much better than those
reported numbers when clients=1 and synchronous_commit=off with the data on
SSD.  I think you said that pgbench is running on a different machine than
the database, so perhaps it is just network overhead that is keeping this
value down.  What happens if you run them on the same machine?

> ++-++
> > | client | Mostly defaults [1] | Optimised settings [2] |
> > ++-++
> > | 1  | 80-86   | 169-180|
> > ++-++
>

Cheers,

Jeff


Re: Benchmarking: How to identify bottleneck (limiting factor) and achieve "linear scalability"?

2019-01-29 Thread Jeff Janes
On Mon, Jan 28, 2019 at 12:03 AM Saurabh Nanda 
wrote:

> All this benchmarking has led me to a philosophical question, why does PG
> need shared_buffers in the first place?
>

PostgreSQL cannot let the OS get its hands on a dirty shared buffer until
the WAL record "protecting" that buffer has been flushed to disk. If a
dirty shared buffer got written to disk, but then a crash happened before
the WAL record go flushed to disk, then the data could be corrupted when it
comes back up. So shared_buffers effectively serves as cooling pond where
dirty buffers wait for their WAL to be flushed naturally so they can be
written without instigating a performance-reducing flush just for them.

Also, concurrent clients needs to access the same disk pages at overlapping
times without corrupting each other.  Perhaps that could be implemented to
have just the buffer headers in shared memory to coordinate the locking,
and not having the buffers themselves in shared memory.  But that is not
how it is currently implemented.


> What's wrong with letting the OS do the caching/buffering?
>

Nothing, and that is what it does.  Which is why the advice for
shared_buffers is often to use a small fraction of RAM, leaving the rest
for the OS to do its thing.  But PostgreSQL still needs a way to lock those
pages, both against concurrent access by its own clients, and against
getting flushed out of order by the OS.  There is no performant way to
release the dirty pages immediately to the OS while still constraining the
order in which the OS flushes them to disk.

Finally, while reading a page from the OS cache into shared_buffers is much
faster than reading it from disk, it is still much slower than finding it
already located in shared_buffers.  So if your entire database fits in RAM,
you will get better performance if shared_buffers is large enough for the
entire thing to fit in there, as well.  This is an exception to the rule
that shared_buffers should be a small fraction of RAM.


> Isn't it optimised for this kind of stuff?
>

Maybe.  But you might be surprised at poorly optimized it is.  It depends
on your OS and version of it, of course.  If you have a high usage_count
buffer which is re-dirtied constantly, it will only get written and flushed
to disk once per checkpoint if under PostgreSQL control. But I've seen
pages like that get written many times per second under kernel control.
Whatever optimization it tried to do, it wasn't very good at.  Also, if
many contiguous pages are dirtied in a close time-frame, but not dirtied in
their physical order, the kernel should be able to re-order them into long
sequential writes, correct?  But empirically, it doesn't, at least back in
the late 2.* series kernels when I did the experiments.  I don't know if it
didn't even try, or tried but failed.  (Of course back then, PostgreSQL
didn't do a good job of it either)

Cheers,

Jeff


Re: Will higher shared_buffers improve tpcb-like benchmarks?

2019-01-29 Thread Jeff Janes
On Tue, Jan 29, 2019 at 6:39 AM Saurabh Nanda 
wrote:

> Hi,
>
> I'm going crazy trying to optimise my Postgres config for a production
> setting [1] Once I realised random changes weren't getting my anywhere, I
> finally purchased PostgreSQL 10 - Higher Performance [2] and understood the
> impact of shared_buffers.
>
> IIUC, shared_buffers won't have any significant impact in the following
> scenario, right?
>
> -- DB size = 30GB
> -- shared_buffers = 2GB
> -- workload = tpcb-like
>
> This is because the tpcb-like workload selects & updates random rows from
> the DB [3]. Therefore, with a 2GB shared buffer, there is only a 6-7%
> chance (did I get my probability correct?) that the required data will be
> in the shared_buffer. Did I understand this correctly?
>

That is likely correct, but the data will likely be stored in the OS file
cache, so reading it from there will still be pretty fast.


>
> If nothing else becomes the bottleneck (eg. periodically writing dirty
> pages to disk), increasing the shared_buffers to 15GB+ should have a
> significant impact, for this DB-size and workload, right? (The system has
> 64 GB RAM)
>

About the only way to know for sure that writing dirty data is not the
bottleneck is to use a read only benchmark, such as the -S flag for
pgbench.  And at that point, the IPC overhead between pgbench and the
backend, even when both are running on the same machine, is likely to be
the bottleneck.  And after that, the bottleneck might shift to opening and
closing transactions and taking and releasing locks[1].

If you overcome that, then you might reliably see a difference between 2GB
and 15GB of shared buffers, because at 2GB each query to pgbench_accounts
is likely to fetch 2 pages into shared_buffers from the OS cache: the index
leaf page for pgbench_accounts_pkey, and the table page for
pgbench_accounts.  At 15GB, the entire index should be reliably in
shared_buffers (after enough warm-up time), so you would only need to fetch
1 page, and often not even that.

Cheers,

Jeff

[1]   I have a very old patch to pgbench that introduces a new query to
overcome this,
https://www.postgresql.org/message-id/BANLkTi%3DQBYOM%2Bzj%3DReQeiEKDyVpKUtHm6Q%40mail.gmail.com
.  I don't know how much work it would be to get it to compile against
newer versions--I stopped maintaining it because it became too much work to
rebase it past conflicting work, and because I lost interest in this line
of research.


Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes
On Wed, Feb 6, 2019 at 5:29 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:


> Now the question is how to handle or tune it ? Is there any change that I
> need to increase the cost_limit / cost_delay ?
>

Sometimes vacuum has more work to do, so it takes more time to do it.

There is no indication of a problem. Or at least, you haven't described
one. So, there is nothing to handle or to tune.

If there is a problem, those log entries might help identify it.  But in
the absence of a problem, they are just log spam.

Cheers,

Jeff


Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread Jeff Janes
On Wed, Feb 6, 2019 at 9:42 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

> Well, basically I'm trying to tune it because the table still keep
> growing. I thought that by setting the scale and the threshold it will be
> enough but its seems that it wasnt. I attached some of the logs output to
> hear what you guys think about it ..
>

Are all four log entries from well after you made the change?  My first
inclination is to think that the first 2 are from either before the change,
or just after the change when it is still settling into the new regime.
Also, is the table still continuing to grow, or is at a new steady-state of
bloat which isn't growing but also isn't shrinking back to where you want
it to be?  More aggressive vacuuming alone should stop the bloat, but is
not likely to reverse it.

I habitually set vacuum_cost_page_hit and vacuum_cost_page_miss to zero.
Page reads are self-limiting (vacuum is single threaded, so you can't have
more than one read (times autovacuum_max_workers) going on at a time) so I
don't see a need to throttle them intentionally as well--unless your entire
db is sitting on one spindle.  Based on the high ratio of read rates to
write rates in the last two log entries, this change alone should be enough
greatly speed up the run time of the vacuum.

If you need to speed it up beyond that, I don't think it matters much
whether you decrease cost_delay or increase cost_limit, it is the ratio
that mostly matters.

And if these latter measures do work, you should consider undoing changes
to autovacuum_vacuum_scale_factor.  Reading the entire index just to remove
10,000 rows from the table is a lot of extra work that might be
unnecessary. Although that extra work might not be on anyone's critical
path.

>


Re: autovacuum big table taking hours and sometimes seconds

2019-02-07 Thread Jeff Janes
On Thu, Feb 7, 2019 at 6:55 AM Mariel Cherkassky <
mariel.cherkas...@gmail.com> wrote:

I have 3 questions :
> 1)To what value do you recommend to increase the vacuum cost_limit ? 2000
> seems reasonable ? Or maybe its better to leave it as default and assign a
> specific value for big tables ?
>

That depends on your IO hardware, and your workload.  You wouldn't want
background vacuum to use so much of your available IO that it starves your
other processes.



> 2)When the autovacuum reaches the cost_limit while trying to vacuum a
> specific table, it wait nap_time seconds and then it continue to work on
> the same table ?
>

No, it waits for autovacuum_vacuum_cost_delay before resuming within the
same table. During this delay, the table is still open and it still holds a
lock on it, and holds the transaction open, etc.  Naptime is entirely
different, it controls how often the vacuum scheduler checks to see which
tables need to be vacuumed again.



> 3)So in case I have a table that keeps growing (not fast because I set the
> vacuum_scale_factor to 0 and the autovacuum_vacuum_threshold to 1). If
> the table keep growing it means I should try to increase the cost right ?
> Do you see any other option ?
>

 You can use pg_freespacemap to see if the free space is spread evenly
throughout the table, or clustered together.  That might help figure out
what is going on.  And, is it the table itself that is growing, or the
index on it?

Cheers,

Jeff


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:

> Thomas Kellerer  writes:
> > The bloom index is only used if either Seq Scan is disabled or if the
> random_page_cost is set to 1 (anything about 1 triggers a Seq Scan on my
> Windows laptop).
>
> Hm.  blcostestimate is using the default cost calculation, except for
>
> /* We have to visit all index tuples anyway */
> costs.numIndexTuples = index->tuples;
>
> which essentially tells genericcostestimate to assume that every index
> tuple will be visited.  This obviously is going to increase the cost
> estimate; maybe there's something wrong with that?
>

I assumed (without investigating yet) that genericcostestimate is applying
a cpu_operator_cost (or a few of them) on each index tuple, while the
premise of a bloom index is that you do very fast bit-fiddling, not more
expense SQL operators, for each tuple and then do the recheck only on what
survives to the table tuple part.

Cheers,

Jeff


Re: Performance regressions found using sqlfuzz

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 4:23 AM Jung, Jinho  wrote:

>
> Hello,
>
> We are developing a tool called sqlfuzz for automatically finding
> performance regressions in PostgreSQL. sqlfuzz performs mutational fuzzing
> to generate SQL queries that take more time to execute on the latest
> version of PostgreSQL compared to prior versions. We hope that these
> queries would help further increase the utility of the regression test
> suite.
>
> We would greatly appreciate feedback from the community regarding the
> queries found by the tool so far. We have already incorporated prior
> feedback from the community in the latest version of sqlfuzz.
>

This approach doesn't seem very exciting to me as-is, because optimization
is a very pragmatic endeavor.  We make decisions all the time that might
make some queries better and others worse.  If the queries that get better
are natural/common ones, and the ones that get worse are weird/uncommon
ones (like generated by a fuzzer), then making that change is an
improvement even if there are some performance (as opposed to correctness)
regressions.

I would be more interested in investigating some of these if the report
would:

1) include the exact commit in which the regression was introduced (i.e.
automate "git bisect").
2) verify that the regression still exists in the dev HEAD and report which
commit it was verified in (since HEAD changes frequently).
3) report which queries (if any) in your corpus were made better by the
same commit which made the victim query worse.

Cheers,

Jeff

>


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 11:58 AM Jeff Janes  wrote:

>
> On Tue, Feb 12, 2019 at 10:42 AM Tom Lane  wrote:
>
>>
>> Hm.  blcostestimate is using the default cost calculation, except for
>>
>> /* We have to visit all index tuples anyway */
>> costs.numIndexTuples = index->tuples;
>>
>> which essentially tells genericcostestimate to assume that every index
>> tuple will be visited.  This obviously is going to increase the cost
>> estimate; maybe there's something wrong with that?
>>
>
> I assumed (without investigating yet) that genericcostestimate is applying
> a cpu_operator_cost (or a few of them) on each index tuple, while the
> premise of a bloom index is that you do very fast bit-fiddling, not more
> expense SQL operators, for each tuple and then do the recheck only on what
> survives to the table tuple part.
>

In order for bloom (or any other users of CREATE ACCESS METHOD, if there
are any) to have a fighting chance to do better, I think many of selfuncs.c
currently private functions would have to be declared in some header file,
perhaps utils/selfuncs.h.  But that then requires a cascade of other
inclusions.  Perhaps that is why it was not done.

Cheers,

Jeff

>


Re: Bloom index cost model seems to be wrong

2019-02-12 Thread Jeff Janes
On Tue, Feb 12, 2019 at 4:17 PM Tom Lane  wrote:

> Jeff Janes  writes:
> > In order for bloom (or any other users of CREATE ACCESS METHOD, if there
> > are any) to have a fighting chance to do better, I think many of
> selfuncs.c
> > currently private functions would have to be declared in some header
> file,
> > perhaps utils/selfuncs.h.  But that then requires a cascade of other
> > inclusions.  Perhaps that is why it was not done.
>
> I'm just in the midst of refactoring that stuff, so if you have
> suggestions, let's hear 'em.
>

The goal would be that I can copy the entire definition of
genericcostestimate into blcost.c, change the function's name, and get it
to compile.  I don't know the correct way accomplish that.  Maybe
utils/selfuncs.h can be expanded to work, or if there should be a new
header file like "utils/index_am_cost.h"

What I've done for now is:

#include "../../src/backend/utils/adt/selfuncs.c"

which I assume is not acceptable as a real solution.


It's possible that a good cost model for bloom is so far outside
> genericcostestimate's ideas that trying to use it is not a good
> idea anyway.
>

I think that might be the case.  I don't know what the right answer would
look like, but I think it will likely end up needing to access everything
that genericcostestimate currently needs to access.  Or if bloom doesn't,
some other extension implementing an ACCESS METHOD will.

Cheers,

Jeff


Re: Why isn't an index scan being used?

2019-02-20 Thread Jeff Janes
On Tue, Feb 19, 2019 at 11:59 PM Abi Noda  wrote:

> Thanks Justin.
>
> The 4ms different in the examples isn't an accurate benchmark. I'm seeing
> about a ~20% difference over a larger sample size. And this is on a fork of
> the production database.
>

Please show the execution plans from that larger sample, if that is the one
that is most relevant.

You can "set enable_bitmapscan = off" to get rid of the bitmap scan in
order to see the estimated cost and actual performance of the next-best
plan (which will probably the regular index scan).

Cheers,

Jeff


Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes
On Sat, Feb 23, 2019 at 4:06 PM Gunther  wrote:

> the dequeue operation is essentially this:
>
> BEGIN
>
> SELECT jobId, action
>   FROM Queue
>   WHERE pending
>   FOR UPDATE SKIP LOCKED
>
>
There is no LIMIT shown.  Wouldn't the first thread to start up just lock
all the rows and everyone else would starve?

Cheers,

Jeff


Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-24 Thread Jeff Janes
On Sun, Feb 24, 2019 at 1:02 PM Gunther  wrote:

> Thank you all for responding so far.
>
> David Rowley  and Justin Pryzby suggested things about autovacuum. But I
> don't think autovacuum has any helpful role here. I am explicitly doing a
> vacuum on that table. And it doesn't help at all. Almost not at all.
>
If you do a vacuum verbose, what stats do you get back?  What is the size
of the index when the degradation starts to show, and immediately after a
successful reindex?

Also, how is JobID assigned?  Are they from a sequence, or some other
method where they are always added to the end of the index's keyspace?

When it starts to degrade, what is the EXPLAIN plan for the query?

Cheers,

Jeff


  1   2   >