Re: GPG signing

2020-05-26 Thread Tim Cross


Marc Munro  writes:

> I need to be able to cryptographically sign objects in my database
> using a public key scheme.
>
> Is my only option to install plpython or some such thing?   Python
> generally makes me unhappy as I'm never convinced of its stability or
> the quality of its APIs, and it is not obvious to me which of the many
> different gpg-ish packages I should choose.
>
> Any other options?  Am I missing something?
>

This is something you would normally implement at the application level,
using the database as just the store for the data and signatures or
signed digests.

Main reason for this is to allow for secure key management. It is very
difficult to implement a secure key management solution at the database
level unless it is designed into the fundamental architecture of the
rdbms. It is the same age old problem - how can you encrypt data AND
have the keys for the encrypted data in the same place. The main reason
for encryption is so that if your store gets compromised, the data
cannot be read. However, if your key is also in the store, then when
your compromised, your key is compromised and your encryption becomes a
mute issue.

If on the other hand you handle the encryption/signing at the application level,
you can separate your key store and data store so that compromise of one
doesn't also compromise the other. This of course does create other
issues - most notably being that now you have an additional mandatory
layer between you and your data (for example, you can use psql to query
your database, but all you can see is encrypted objects.

In your case, this may not be as big an issue because you state you want
to sign rather than encrypt. You could, for example, design your
application so that the data is in one column and the signature is in
the other (or use json or other 'object' types that allow
attributes/properties). This would allow simple querying of the data and
verification of data integrity to be performed as separate operations.
All you then need to ensure is that every time data is modified, a new
signature is generated.

I would also verify you really do need full cryptographic signing rather
than just some less rigid integrity verification, like a basic checksum
hash. Crypto signing is most useful when you want to both verify the
integrity of something and it's source. i.e. this data has not been
changed and was signed by X. In this case, each source is
encrypted/signed with a different private/secret key. If on the other
hand you just want to know that the data has not been modified, you can
generate a checksum/hash of the data when it is inserted/updated and
store that in a separate column. This data may or may not be encrypted
depending on your use case. In this situation, you only need one key,
the key used to encrypt the column or no keys if you don't actually need
to encrypt it.

While non-encrypted checksums is not as secure, not all applications
need that level of security. In some cases, having a separate checksum
is sufficient. If someone wants to modify the data 'unofficially', in
addition to compromising the db and modifying the data, they have to
also generate a new hash of the modified data and store that in the
appropriate column. If you just want to protect against accidental
modification of the data or have reasonable confidence (for some measure
of reasonable), just having a checksum hash may be sufficient. big
advantage with the simpler case with no actual data encryption is that
other clients can access/use the data and not require access to the
encryption key. 

-- 
Tim Cross




Re: GPG signing

2020-05-26 Thread raf
Marc Munro wrote:

> On Tue, 2020-05-26 at 12:04 -0700, Adrian Klaver wrote:
> > On 5/26/20 12:01 PM, Marc Munro wrote:
> > > I need to be able to cryptographically sign objects in my database
> > > using a public key scheme.
> > > [ . . . ]
> > > Any other options?  Am I missing something?
> > 
> > https://www.postgresql.org/docs/12/pgcrypto.html#id-1.11.7.34.7
> 
> I looked at that but I must be missing something.  In order to usefully
> sign something, the private, secret, key must be used to encrypt a
> disgest of the thing being signed (something of a simplification, but
> that's the gist).  This can then be verified, by anyone, using the
> public key.
> 
> But the pgcrypto functions, for good reasons, do not allow the private
> (secret) key to be used in this way.  Encryption and signing algorithms
> are necessarily different as the secret key must be protected; and we
> don't want signatures to be huge, and it seems that pgcrypto has not
> implemented signing algorithms.
> 
> What am I missing?
> 
> __
> Marc

That page linked to above says:

  F.25.3.10. Limitations of PGP Code

No support for signing.





Re: GPG signing

2020-05-26 Thread Marc Munro
On Tue, 2020-05-26 at 12:04 -0700, Adrian Klaver wrote:
> On 5/26/20 12:01 PM, Marc Munro wrote:
> > I need to be able to cryptographically sign objects in my database
> > using a public key scheme.
> > [ . . . ]
> > Any other options?  Am I missing something?
> 
> https://www.postgresql.org/docs/12/pgcrypto.html#id-1.11.7.34.7

I looked at that but I must be missing something.  In order to usefully
sign something, the private, secret, key must be used to encrypt a
disgest of the thing being signed (something of a simplification, but
that's the gist).  This can then be verified, by anyone, using the
public key.

But the pgcrypto functions, for good reasons, do not allow the private
(secret) key to be used in this way.  Encryption and signing algorithms
are necessarily different as the secret key must be protected; and we
don't want signatures to be huge, and it seems that pgcrypto has not
implemented signing algorithms.

What am I missing?

__
Marc




Re: GPG signing

2020-05-26 Thread David G. Johnston
On Tuesday, May 26, 2020, Marc Munro  wrote:

> I need to be able to cryptographically sign objects in my database
> using a public key scheme.
>
> Any other options?  Am I missing something?
>

This feels like it should be an application (middleware...) concern, not
the database proper.  i.e., store previously signed data into a normal text
or bytea field.

That said there is:

 https://www.postgresql.org/docs/12/pgcrypto.html#id-1.11.7.34.7

David J.


Re: GPG signing

2020-05-26 Thread Adrian Klaver

On 5/26/20 12:01 PM, Marc Munro wrote:

I need to be able to cryptographically sign objects in my database
using a public key scheme.

Is my only option to install plpython or some such thing?   Python
generally makes me unhappy as I'm never convinced of its stability or
the quality of its APIs, and it is not obvious to me which of the many
different gpg-ish packages I should choose.

Any other options?  Am I missing something?


https://www.postgresql.org/docs/12/pgcrypto.html#id-1.11.7.34.7



Thanks.

__
Marc






--
Adrian Klaver
adrian.kla...@aklaver.com




GPG signing

2020-05-26 Thread Marc Munro
I need to be able to cryptographically sign objects in my database
using a public key scheme.

Is my only option to install plpython or some such thing?   Python
generally makes me unhappy as I'm never convinced of its stability or
the quality of its APIs, and it is not obvious to me which of the many
different gpg-ish packages I should choose.

Any other options?  Am I missing something?

Thanks.

__
Marc





Re: Advise on how to install pl/perl on existing DB.

2020-05-26 Thread Adrian Klaver

On 5/26/20 9:32 AM, David Gauthier wrote:

psql (9.6.0, server 11.3) linux

Hi:
I'm a PG users who has asked our IT team to install pl/perlu on an 
existing 9.6.0 instance on linux.  They really don't know how to 
approach this.  Could someone point me to a good step-by-step  
(including ptrs to any downloads they may need) ?

Also, when they do this, will it require DB downtime ?


Your subject says pl/perl and the above says pl/perlu. So in addition to 
what Alan said:


Trusted and Untrusted PL/Perl

https://www.postgresql.org/docs/12/plperl-trusted.html



Thanks in Advance !



--
Adrian Klaver
adrian.kla...@aklaver.com




Re: Advise on how to install pl/perl on existing DB.

2020-05-26 Thread Alan Hodgson
On Tue, 2020-05-26 at 12:32 -0400, David Gauthier wrote:
> psql (9.6.0, server 11.3) linux
> 
> Hi:  
> I'm a PG users who has asked our IT team to install pl/perlu on an
> existing 9.6.0 instance on linux.  They really don't know how to
> approach this.  Could someone point me to a good step-by-step 
> (including ptrs to any downloads they may need) ?
>  
> Also, when they do this, will it require DB downtime ? 
>  
> Thanks in Advance !

a) that says your server is running 11.3, not 9.6 (you might want to
update your own machine).

b) That really depends on how it was installed.

If they used a normal package repo they'll just want to install the
postgresql11-plperl package or equivalent, and then restart PostgreSQL
(although I'm not actually sure the restart is required?) and then do a
"CREATE EXTENSION plperlu;" in your database.

If they did some kind of custom installation then they might need to
just figure it out.


Advise on how to install pl/perl on existing DB.

2020-05-26 Thread David Gauthier
psql (9.6.0, server 11.3) linux

Hi:
I'm a PG users who has asked our IT team to install pl/perlu on an existing
9.6.0 instance on linux.  They really don't know how to approach this.
Could someone point me to a good step-by-step  (including ptrs to any
downloads they may need) ?

Also, when they do this, will it require DB downtime ?

Thanks in Advance !


Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-26 Thread Mike Klaas
On second look, it does seems the xid crossed the 2^32 mark recently, since 
most tables have a frozenxid close to 4b and the current xid is ~50m:

SELECT relname, age(relfrozenxid), relfrozenxid FROM pg_class WHERE relkind = 
'r' and relname not like 'pg%' order by relname;

relname  |    age    | relfrozenxid

---+---+--

  | 107232506 |   4237961815

  |  93692362 |   4251501959

  | 183484103 |   4161710218

  |  50760536 |   4294433785

  |  58821410 |   4286372911

  | 117427283 |   4227767038

  |  9454 |   4250653210

…

select max(backend_xid::text), min(backend_xmin::text) from pg_stat_activity 
where state='active';

max | min

--+--

50350294 | 50350065

-Mike

On Tue, May 26, 2020 at 8:42 AM, Mike Klaas < m...@superhuman.com > wrote:

> 
> On Fri, May 22, 2020 at 3:15 PM, Thomas Munro < thomas. munro@ gmail. com (
> thomas.mu...@gmail.com ) > wrote:
> 
>> 
>> 
>> Predicate locks are released by ClearOldPredicateLocks(), which releases
>> SERIALIZABLEXACTs once they are no longer interesting. It has a
>> conservative idea of what is no longer interesting: it waits until the
>> lowest xmin across active serializable snapshots is >= the transaction's
>> finishedBefore xid, which was the system's next xid (an xid that hasn't
>> been used yet*) at the time the SERIALIZABLEXACT committed. One
>> implication of this scheme is that SERIALIZABLEXACTs are cleaned up in
>> commit order. If you somehow got into a state where a few of them were
>> being kept around for a long time, but others committed later were being
>> cleaned up (which I suppose must be the case or your system would be
>> complaining about running out of SERIALIZABLEXACTs), that might imply that
>> there is a rare leak somewhere in this scheme. In the past I have wondered
>> if there might be a problem with wraparound in the xid tracking for
>> finished transactions, but I haven't worked out the details (transaction
>> ID wraparound is both figuratively and literally the Ground Hog Day of
>> PostgreSQL bug surfaces).
>> 
>> 
>> 
>> 
> 
> 
> 
> Thanks for the detailed reply, Thomas.  Is SERIALIZABLEXACT transaction ID
> wraparound the same as global xid wraparound?  The max transaction age in
> the db is ~197M [1] so I don't think we've gotten close to global
> wraparound lately.
> 
> 
> 
> Would it be helpful to cross-post this thread to pgsql-bugs or further
> investigate on my end
> 
> 
> 
> -Mike
> 
> 
> 
> [1] superhuman@ production => select datname, datfrozenxid,
> age(datfrozenxid) from pg_catalog.pg_database;
> 
> 
> datname | datfrozenxid | age
> 
> 
> 
> 
> ---+--+---
> 
> 
> 
> 
> cloudsqladmin | 4173950091 | 169089900
> 
> 
> 
> 
> template0 | 4266855294 | 76184697
> 
> 
> 
> 
> postgres | 4173951306 | 169088685
> 
> 
> 
> 
> template1 | 4266855860 | 76184131
> 
> 
> 
> 
> superhuman | 4145766807 | 197273184
> 
> 
>

Re: Help understanding SIReadLock growing without bound on completed transaction

2020-05-26 Thread Mike Klaas
On Fri, May 22, 2020 at 3:15 PM, Thomas Munro < thomas.mu...@gmail.com > wrote:

> 
> 
> 
> Predicate locks are released by ClearOldPredicateLocks(), which releases
> SERIALIZABLEXACTs once they are no longer interesting. It has a
> conservative idea of what is no longer interesting: it waits until the
> lowest xmin across active serializable snapshots is >= the transaction's
> finishedBefore xid, which was the system's next xid (an xid that hasn't
> been used yet*) at the time the SERIALIZABLEXACT committed. One
> implication of this scheme is that SERIALIZABLEXACTs are cleaned up in
> commit order. If you somehow got into a state where a few of them were
> being kept around for a long time, but others committed later were being
> cleaned up (which I suppose must be the case or your system would be
> complaining about running out of SERIALIZABLEXACTs), that might imply that
> there is a rare leak somewhere in this scheme. In the past I have wondered
> if there might be a problem with wraparound in the xid tracking for
> finished transactions, but I haven't worked out the details (transaction
> ID wraparound is both figuratively and literally the Ground Hog Day of
> PostgreSQL bug surfaces).
> 
> 
> 
> 

Thanks for the detailed reply, Thomas.  Is SERIALIZABLEXACT transaction ID 
wraparound the same as global xid wraparound?  The max transaction age in the 
db is ~197M [1] so I don't think we've gotten close to global wraparound lately.

Would it be helpful to cross-post this thread to pgsql-bugs or further 
investigate on my end

-Mike

[1] superhuman@ production => select datname, datfrozenxid, age(datfrozenxid) 
from pg_catalog.pg_database;

datname | datfrozenxid | age

---+--+---

cloudsqladmin | 4173950091 | 169089900

template0 | 4266855294 | 76184697

postgres | 4173951306 | 169088685

template1 | 4266855860 | 76184131

superhuman | 4145766807 | 197273184

Re: pg_basebackup + incremental base backups

2020-05-26 Thread Stephen Frost
Greetings,

* Christopher Pereira (krip...@imatronix.cl) wrote:
> On 24-May-20 15:48, Stephen Frost wrote:
> >That really shouldn't be possible.  I'm very curious as to exactly what
> >happened that resulted in your primary/replica being 'out of sync', as
> >you say.
> 
> Actually this was more a hypothetical question to find a solution in case
> some day one of our standby clusters goes out of sync and we have to rebuild
> it having a very big database.

"out of sync" is a bit of an odd concept, but having a replica fall
behind a long way is certainly something that can happen and may require
a rebuild from a backup (or from a new sync off of the primary in some
other way, as you suggest below).  In a situation where there's async
replication happening and you promote a replica to take over, that's
definitely a case where you might also have to rebuild the former
primary.

> With proper WAL archiving this shouldn't happen but we wanted to be prepared
> for this scenario just in case.

Sure, having WAL archiving and good backups is definitely my strong
recommendation for how to address any cases where the replica falls
behind, or the replica is promoted and you want to remaster the former
primary.

> We did some tests measuring IO and traffic and are very happy with the
> results. We will definitely be adding pgBackRest to our toolchain.

Glad to hear that.

> Regarding my initial question, I still believe that the world deserves a
> simple direct pg_basebackup replacement even when putting an additional
> "repo host" in the middle is a better idea in the long term.

Perhaps..

> As you said, all the pieces are there and it would be quite easy to write a
> new "pg_basebackup_delta" script that could be executed on the standby host
> to:
> 
> 1) setup a pgBackRest repo on the primary host (via SSH)
> 
> 2) create a backup on the primary host (via SSH)
> 
> 3) do a delta restore on the standby
> 
> Even when the repository on the primary host is only created temporarily
> (and require double storage, resources, etc), it may still be worth
> considering the traffic that can be saved by doing a delta restore on a
> standby host in a different region, right?

So...  There's actually a way to do this with pgbackrest, but it doesn't
support the delta capability.

We take care in pgbackrest to make the repo format for full backups
actually match exactly what a PG cluster would look like, specifically
because we wish to allow users to, if absolutely everything else fails
and pgbackrest is non-functional, rebuild from the repo.  To that point,
we even make sure that command-line tools like gzip, bzip2, openssl,
etc, work with the files we create.

For what you're asking about though, you would do something like:

- Set up a pgbackrest repo on the host you're rebuilding (the replica)
- Disable all repo compression, encryption, et al.
- Enable archive copy, so the WAL for the backup is put into the backup
- Enable parallel jobs
- Set up pgbackrest on the primary with the replica configured as the
  repo host, get WAL archiving working and such.
- Run create-stanza
- Perform a *full* backup, make sure everything works.

Once that's done, you go find the pg_data directory inside the full
backup that you made inside of the pgbackrest repo, copy/move/rsync
(with checksums enabled!) that to where your PG data directory should
be, set up a recovery.conf to point to the primary, maybe also set it up
with a restore_command pointing to that repo (may or may not be needed,
depending) and start up PG.  That should allow PG to start, replay all
of the necessary WAL, and then connect to the primary and start
streaming.

If you have any tablespaces, you'd need to deal with those too, of
course.  If you start moving things out of the pgbackrest repo, you're
corrupting it, naturally, so you wouldn't really want to continue using
it once you've gotten all of this done.

Again, this isn't an approach which I'd generally recommend...  We used
to do it when we were first writing pgbackrest with a ZFS filesystem and
after a backup we'd create a snapshot off of the repo and start PG right
up and then run pg_dump on it and do other checks to make sure it
worked, but have moved to using pgbackrest delta restores instead for
that kind of use-case.

Thanks,

Stephen


signature.asc
Description: PGP signature


Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 1:45 PM, David Rowley wrote:

On Tue, 26 May 2020 at 22:31, Frank Millman  wrote:


Thank you David. I tried that and it produced the correct result in
53ms, which is what I am looking for.

It will take me some time to understand it fully, so I have some
homework to do!


The main problem with your previous query was that the subquery was
being executed 11088 times and could only ever find anything 167
times. The remaining number of times nothing would be found.

I just changed the subquery which would be executed once per output
row and altered it so it became a subquery that's joined and only
executed once.  The ROW_NUMBER() is a windowing function, which is
explained in [1].  I used this to get the row_id of the record with
the lowest tran_date, just like you were doing with the ORDER BY
tran_date DESC LIMIT 1, but the subquery with the windowing function
gets them all at once, rather than doing it in a way that requires it
to be executed once for each row in the top-level query. In this case,
the functionality that the LIMIT 1 does in your query is achieved with
"AND a.row_num = 1;" in my version. This is pretty fast to execute
once due to there only being 167 rows.

It's also important to know that there may be cases where the method I
proposed is slower.  For example, if my_table was very large and
contained rows that were not in table_1 to table_4. Since the subquery
in my version calculates everything then it could be wasteful to do
that for values that would never be used. For you, you have foreign
keys that ensure my_table does not contain records that are not in the
other tables, but you could still see this issue if you were to add
some restrictive WHERE clause to the outer query.  Perhaps this won't
be a problem for you, but it's likely good to know.

[1] https://www.postgresql.org/docs/current/tutorial-window.html



Thanks very much for the explanation. I will go through it carefully.

For the record, your query works without modification in both Sql Server 
and sqlite3. It is also much faster in all three cases - all around 
0.005 sec instead of 0.05 sec.


Frank




Re: Slow SELECT

2020-05-26 Thread David Rowley
On Tue, 26 May 2020 at 23:41, Vik Fearing  wrote:
>
> On 5/26/20 12:04 PM, David Rowley wrote:
> > Since "my_table" is small, you'd likely be much better doing a manual
> > rewrite of the query to join a subquery containing the required
> > details from "my_table".  It looks like you want the row_id from the
> > latest tran_date for each fld_N column. So something like:
> >
> > SELECT a.row_id
> > FROM table_1 b
> > CROSS JOIN table_2 c
> > CROSS JOIN table_3 d
> > CROSS JOIN table_4 e
> > LEFT OUTER JOIN (
> > SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> > ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> > tran_date DESC) row_num
> > FROM my_table
> > WHERE deleted_id = 0
> > ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> > d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> >
> > Should do it. You could also perhaps do something with DISTINCT ON
> > instead of using ROW_NUMBER(). That might be a bit more efficient, but
> > it's unlikely to matter too much since there are only 167 rows in that
> > table.
>
> I would expect a lateral query to be better here.

But that would put it back to executing the subquery 11088 times. I
wrote it in a way to avoid that.

David




Re: Slow SELECT

2020-05-26 Thread David Rowley
On Tue, 26 May 2020 at 22:31, Frank Millman  wrote:

> Thank you David. I tried that and it produced the correct result in
> 53ms, which is what I am looking for.
>
> It will take me some time to understand it fully, so I have some
> homework to do!

The main problem with your previous query was that the subquery was
being executed 11088 times and could only ever find anything 167
times. The remaining number of times nothing would be found.

I just changed the subquery which would be executed once per output
row and altered it so it became a subquery that's joined and only
executed once.  The ROW_NUMBER() is a windowing function, which is
explained in [1].  I used this to get the row_id of the record with
the lowest tran_date, just like you were doing with the ORDER BY
tran_date DESC LIMIT 1, but the subquery with the windowing function
gets them all at once, rather than doing it in a way that requires it
to be executed once for each row in the top-level query. In this case,
the functionality that the LIMIT 1 does in your query is achieved with
"AND a.row_num = 1;" in my version. This is pretty fast to execute
once due to there only being 167 rows.

It's also important to know that there may be cases where the method I
proposed is slower.  For example, if my_table was very large and
contained rows that were not in table_1 to table_4. Since the subquery
in my version calculates everything then it could be wasteful to do
that for values that would never be used. For you, you have foreign
keys that ensure my_table does not contain records that are not in the
other tables, but you could still see this issue if you were to add
some restrictive WHERE clause to the outer query.  Perhaps this won't
be a problem for you, but it's likely good to know.

[1] https://www.postgresql.org/docs/current/tutorial-window.html

David




Re: Slow SELECT

2020-05-26 Thread Vik Fearing
On 5/26/20 12:04 PM, David Rowley wrote:
> Since "my_table" is small, you'd likely be much better doing a manual
> rewrite of the query to join a subquery containing the required
> details from "my_table".  It looks like you want the row_id from the
> latest tran_date for each fld_N column. So something like:
> 
> SELECT a.row_id
> FROM table_1 b
> CROSS JOIN table_2 c
> CROSS JOIN table_3 d
> CROSS JOIN table_4 e
> LEFT OUTER JOIN (
> SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
> ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
> tran_date DESC) row_num
> FROM my_table
> WHERE deleted_id = 0
> ) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
> d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;
> 
> Should do it. You could also perhaps do something with DISTINCT ON
> instead of using ROW_NUMBER(). That might be a bit more efficient, but
> it's unlikely to matter too much since there are only 167 rows in that
> table.

I would expect a lateral query to be better here.

SELECT a.*
FROM table_1 AS b
CROSS JOIN table_2 AS c
CROSS JOIN table_3 AS d
CROSS JOIN table_4 AS e
CROSS JOIN LATERAL (
SELECT *
FROM my_table AS a
WHERE (a.fld_1, a.fld_2, a.fld_3, a.fld_4) = (b.row_id, c.row_id,
d.row_id, e.row_id)
  AND a.deleted = 0
ORDER BY a.tran_date DESC
FETCH FIRST ROW ONLY
) AS a
WHERE a.row_id IS NOT NULL;

You will likely want an index on my_table (fld_1, fld_2, fld_3, fld_4,
tran_date) if your dataset gets bigger.

This query is 100% Standard SQL, so it *should* work on other engines.
That doesn't mean it will, though.
-- 
Vik Fearing




Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 12:04 PM, David Rowley wrote:

On Tue, 26 May 2020 at 19:23, Frank Millman  wrote:

The table sizes are -
  my_table : 167 rows
  table_1 : 21 rows
  table_2 : 11 rows
  table_3 : 3 rows
  table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those
potential groups. This is my select -

  SELECT (
  SELECT a.row_id FROM my_table a
  WHERE a.fld_1 = b.row_id
  AND a.fld_2 = c.row_id
  AND a.fld_3 = d.row_id
  AND a.fld_4 = e.row_id
  AND a.deleted_id = 0
  ORDER BY a.tran_date DESC LIMIT 1
  )
  FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.


Perhaps SQL Server is doing something to rewrite the subquery in the
target list to a LEFT JOIN.  PostgreSQL currently does not do that.

Since "my_table" is small, you'd likely be much better doing a manual
rewrite of the query to join a subquery containing the required
details from "my_table".  It looks like you want the row_id from the
latest tran_date for each fld_N column. So something like:

SELECT a.row_id
FROM table_1 b
CROSS JOIN table_2 c
CROSS JOIN table_3 d
CROSS JOIN table_4 e
LEFT OUTER JOIN (
SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;

Should do it. You could also perhaps do something with DISTINCT ON
instead of using ROW_NUMBER(). That might be a bit more efficient, but
it's unlikely to matter too much since there are only 167 rows in that
table.



Thank you David. I tried that and it produced the correct result in 
53ms, which is what I am looking for.


It will take me some time to understand it fully, so I have some 
homework to do!


Much appreciated.

Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 12:02 PM, Christian Ramseyer wrote:

Hi

On 26.05.20 09:22, Frank Millman wrote:


I have looked at the EXPLAIN, but I don't really know what to look for.
I can supply it if that would help.



My favorite approach to tuning Postgres queries is:

1. Run EXPLAIN ANALYZE 
2. Copy/Paste the output into the fantastic https://explain.depesz.com/

This will turn the somewhat hard-to-understand explain output into a
nice colored structure. If it's not obvious from the orange-reddish
boxes where the slowness comes from, please post the link here and
somebody will certainly have some advice.



Thanks, Christian. I will definitely look into that.

Frank





Re: Slow SELECT

2020-05-26 Thread Frank Millman




On 2020-05-26 11:27 AM, Charles Clavadetscher wrote:

On 2020-05-26 11:10, Charles Clavadetscher wrote:

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman > wrote:


    Hi all

    I have a SELECT that runs over 5 times slower on PostgreSQL 
compared

    with Sql Server and sqlite3. I am trying to understand why.

    I have a table that looks like this (simplified) -

    CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

    CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
    fld_4, tran_date) WHERE deleted_id = 0;

    The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

    Therefore for each tran_date in my_table there are potentially
    21x11x3x16 = 11088 rows. Most will be null.

    I want to select the row_id for the last tran_date for each of 
those

    potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

    Out of 11088 rows selected, 103 are not null.

    On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
    sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL 
does too on consecutive calls to the same table. What execution time 
do you get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


    I have looked at the EXPLAIN, but I don't really know what to 
look for.

    I can supply it if that would help.

    Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 
rows=3 width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables,
i.e. table_1 to table_4.
Do you have an index on the referenced keys (row_id) in table_1 to 
table_4?


It happens often that referenced keys are not indexed, leading to poor
execution plans.

Bye
Charles


I noticed later that you have very small tables. This will probably lead 
to a sequential scan althought there is an index in place.


I am not sure if it makes a difference, but what about using explicit 
joins?


SELECT a.row_id FROM my_table a
JOIN b table_1 ON (b.row_id = a.fld_1)
JOIN c table_2 ON (c.row_id = a.fld_2)
JOIN d table_3 ON (d.row_id = a.fld_3)
JOIN e table_4 ON (e.row_id = a.fld_4)
WHERE a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1;



Thanks, Charles. I tried that, but unfortunately it produces a different 
result. I need to test for every possible combination of fld1-4, and get 
the highest date for each one. Using joins only tests existing 
combinations, and gets the highest date for all of them combined.


Seel my reply to David Rowley. I do not fully understand his solution 
yet, but it seems to be what I am looking for.


Thanks again

Frank





Re: Slow SELECT

2020-05-26 Thread David Rowley
On Tue, 26 May 2020 at 19:23, Frank Millman  wrote:
> The table sizes are -
>  my_table : 167 rows
>  table_1 : 21 rows
>  table_2 : 11 rows
>  table_3 : 3 rows
>  table_4 : 16 rows
>
> Therefore for each tran_date in my_table there are potentially
> 21x11x3x16 = 11088 rows. Most will be null.
>
> I want to select the row_id for the last tran_date for each of those
> potential groups. This is my select -
>
>  SELECT (
>  SELECT a.row_id FROM my_table a
>  WHERE a.fld_1 = b.row_id
>  AND a.fld_2 = c.row_id
>  AND a.fld_3 = d.row_id
>  AND a.fld_4 = e.row_id
>  AND a.deleted_id = 0
>  ORDER BY a.tran_date DESC LIMIT 1
>  )
>  FROM table_1 b, table_2 c, table_3 d, table_4 e
>
> Out of 11088 rows selected, 103 are not null.

Perhaps SQL Server is doing something to rewrite the subquery in the
target list to a LEFT JOIN.  PostgreSQL currently does not do that.

Since "my_table" is small, you'd likely be much better doing a manual
rewrite of the query to join a subquery containing the required
details from "my_table".  It looks like you want the row_id from the
latest tran_date for each fld_N column. So something like:

SELECT a.row_id
FROM table_1 b
CROSS JOIN table_2 c
CROSS JOIN table_3 d
CROSS JOIN table_4 e
LEFT OUTER JOIN (
SELECT fld_1,fld_2,fld_3,fld_4,row_id,tran_date,
ROW_NUMBER() OVER (PARTITION BY fld_1,fld_2,fld_3,fld_4 ORDER BY
tran_date DESC) row_num
FROM my_table
WHERE deleted_id = 0
) a ON a.fld_1 = b.row_id AND a.fld_2 = c.row_id AND a.fld_3 =
d.row_id AND a.fld_4 = e.row_id AND a.row_num = 1;

Should do it. You could also perhaps do something with DISTINCT ON
instead of using ROW_NUMBER(). That might be a bit more efficient, but
it's unlikely to matter too much since there are only 167 rows in that
table.

David




Re: Slow SELECT

2020-05-26 Thread Christian Ramseyer
Hi

On 26.05.20 09:22, Frank Millman wrote:
> 
> I have looked at the EXPLAIN, but I don't really know what to look for.
> I can supply it if that would help.
> 

My favorite approach to tuning Postgres queries is:

1. Run EXPLAIN ANALYZE 
2. Copy/Paste the output into the fantastic https://explain.depesz.com/

This will turn the somewhat hard-to-understand explain output into a
nice colored structure. If it's not obvious from the orange-reddish
boxes where the slowness comes from, please post the link here and
somebody will certainly have some advice.

Cheers
Christian



-- 
Christian Ramseyer, netnea ag
Network Management. Security. OpenSource.
https://www.netnea.com







Re: Slow SELECT

2020-05-26 Thread Frank Millman

On 2020-05-26 11:10 AM, Charles Clavadetscher wrote:

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman > wrote:


    Hi all

    I have a SELECT that runs over 5 times slower on PostgreSQL compared
    with Sql Server and sqlite3. I am trying to understand why.

    I have a table that looks like this (simplified) -

    CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

    CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
    fld_4, tran_date) WHERE deleted_id = 0;

    The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

    Therefore for each tran_date in my_table there are potentially
    21x11x3x16 = 11088 rows. Most will be null.

    I want to select the row_id for the last tran_date for each of those
    potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

    Out of 11088 rows selected, 103 are not null.

    On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
    sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do 
you get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


    I have looked at the EXPLAIN, but I don't really know what to 
look for.

    I can supply it if that would help.

    Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 
rows=3 width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables, i.e. 
table_1 to table_4.

Do you have an index on the referenced keys (row_id) in table_1 to table_4?

It happens often that referenced keys are not indexed, leading to poor 
execution plans.




The referenced keys are all defined as SERIAL PRIMARY KEY in their own 
tables, so I presume that that are all indexed automatically.


On the other hand, there are not many rows in those tables, so the 
planner may decide not to use the index in that case.


Frank





Re: Slow SELECT

2020-05-26 Thread Charles Clavadetscher

On 2020-05-26 11:10, Charles Clavadetscher wrote:

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman > wrote:


Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL 
compared

with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, 
fld_3,

fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of 
those

potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do 
you get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


I have looked at the EXPLAIN, but I don't really know what to 
look for.

I can supply it if that would help.

Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 
rows=3 width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables,
i.e. table_1 to table_4.
Do you have an index on the referenced keys (row_id) in table_1 to 
table_4?


It happens often that referenced keys are not indexed, leading to poor
execution plans.

Bye
Charles


I noticed later that you have very small tables. This will probably lead 
to a sequential scan althought there is an index in place.


I am not sure if it makes a difference, but what about using explicit 
joins?


SELECT a.row_id FROM my_table a
JOIN b table_1 ON (b.row_id = a.fld_1)
JOIN c table_2 ON (c.row_id = a.fld_2)
JOIN d table_3 ON (d.row_id = a.fld_3)
JOIN e table_4 ON (e.row_id = a.fld_4)
WHERE a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1;

Regards
Charles




Re: Slow SELECT

2020-05-26 Thread Charles Clavadetscher

Hello

On 2020-05-26 10:38, Frank Millman wrote:

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman > wrote:


Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL 
compared

with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of 
those

potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do you 
get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and 
check in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


I have looked at the EXPLAIN, but I don't really know what to look 
for.

I can supply it if that would help.

Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN
shows that it is using a SEQUENTIAL SCAN.

Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 
width=4)

   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 rows=3 
width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 
width=4)

   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)

   Filter: ((fld_1 = b.row_id) AND (fld_2 =
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND
(deleted_id = 0))


Frank


If I see it correct, the query runs sequential scans on all tables, i.e. 
table_1 to table_4.
Do you have an index on the referenced keys (row_id) in table_1 to 
table_4?


It happens often that referenced keys are not indexed, leading to poor 
execution plans.


Bye
Charles




Re: Slow SELECT

2020-05-26 Thread Frank Millman

On 2020-05-26 9:32 AM, Olivier Gautherot wrote:

Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman > wrote:


Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared
with Sql Server and sqlite3. I am trying to understand why.

I have a table that looks like this (simplified) -

CREATE TABLE my_table (
      row_id SERIAL PRIMARY KEY,
      deleted_id INT DEFAULT 0,
      fld_1 INT REFERENCES table_1(row_id),
      fld_2 INT REFERENCES table_2(row_id),
      fld_3 INT REFERENCES table_3(row_id),
      fld_4 INT REFERENCES table_4(row_id),
      tran_date DATE,
      tran_total DEC(21,2)
      );

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
fld_4, tran_date) WHERE deleted_id = 0;

The table sizes are -
      my_table : 167 rows
      table_1 : 21 rows
      table_2 : 11 rows
      table_3 : 3 rows
      table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially
21x11x3x16 = 11088 rows. Most will be null.

I want to select the row_id for the last tran_date for each of those
potential groups. This is my select -

      SELECT (
          SELECT a.row_id FROM my_table a
          WHERE a.fld_1 = b.row_id
          AND a.fld_2 = c.row_id
          AND a.fld_3 = d.row_id
          AND a.fld_4 = e.row_id
          AND a.deleted_id = 0
          ORDER BY a.tran_date DESC LIMIT 1
      )
      FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
sqlite3, and 0.31 sec on PostgreSQL.


SQL Server does a good job at caching data in memory. PostgreSQL does 
too on consecutive calls to the same table. What execution time do you 
get if you issue the query a second time?


My first guess would be to add an index on my_table.tran_date and check 
in EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


I have looked at the EXPLAIN, but I don't really know what to look for.
I can supply it if that would help.

Thanks for any advice.



Thanks Olivier. Unfortunately that did not help.

I was already running the query twice and only timing the second one.

I added the index on tran_date. The timing is the same, and EXPLAIN 
shows that it is using a SEQUENTIAL SCAN.


Here is the EXPLAIN -


 Nested Loop  (cost=0.00..64155.70 rows=11088 width=4)
   ->  Nested Loop  (cost=0.00..10.36 rows=528 width=12)
 ->  Nested Loop  (cost=0.00..2.56 rows=33 width=8)
   ->  Seq Scan on table_2 c  (cost=0.00..1.11 rows=11 width=4)
   ->  Materialize  (cost=0.00..1.04 rows=3 width=4)
 ->  Seq Scan on table_3 d  (cost=0.00..1.03 rows=3 
width=4)

 ->  Materialize  (cost=0.00..1.24 rows=16 width=4)
   ->  Seq Scan on table_4 e  (cost=0.00..1.16 rows=16 width=4)
   ->  Materialize  (cost=0.00..1.31 rows=21 width=4)
 ->  Seq Scan on table_1 b  (cost=0.00..1.21 rows=21 width=4)
   SubPlan 1
 ->  Limit  (cost=5.77..5.77 rows=1 width=8)
   ->  Sort  (cost=5.77..5.77 rows=1 width=8)
 Sort Key: a.tran_date DESC
 ->  Seq Scan on my_table a  (cost=0.00..5.76 rows=1 
width=8)
   Filter: ((fld_1 = b.row_id) AND (fld_2 = 
c.row_id) AND (fld_3 = d.row_id) AND (fld_4 = e.row_id) AND (deleted_id 
= 0))



Frank




Re: Slow SELECT

2020-05-26 Thread Olivier Gautherot
Hi Frank,

On Tue, May 26, 2020 at 9:23 AM Frank Millman  wrote:

> Hi all
>
> I have a SELECT that runs over 5 times slower on PostgreSQL compared
> with Sql Server and sqlite3. I am trying to understand why.
>
> I have a table that looks like this (simplified) -
>
> CREATE TABLE my_table (
>  row_id SERIAL PRIMARY KEY,
>  deleted_id INT DEFAULT 0,
>  fld_1 INT REFERENCES table_1(row_id),
>  fld_2 INT REFERENCES table_2(row_id),
>  fld_3 INT REFERENCES table_3(row_id),
>  fld_4 INT REFERENCES table_4(row_id),
>  tran_date DATE,
>  tran_total DEC(21,2)
>  );
>
> CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3,
> fld_4, tran_date) WHERE deleted_id = 0;
>
> The table sizes are -
>  my_table : 167 rows
>  table_1 : 21 rows
>  table_2 : 11 rows
>  table_3 : 3 rows
>  table_4 : 16 rows
>
> Therefore for each tran_date in my_table there are potentially
> 21x11x3x16 = 11088 rows. Most will be null.
>
> I want to select the row_id for the last tran_date for each of those
> potential groups. This is my select -
>
>  SELECT (
>  SELECT a.row_id FROM my_table a
>  WHERE a.fld_1 = b.row_id
>  AND a.fld_2 = c.row_id
>  AND a.fld_3 = d.row_id
>  AND a.fld_4 = e.row_id
>  AND a.deleted_id = 0
>  ORDER BY a.tran_date DESC LIMIT 1
>  )
>  FROM table_1 b, table_2 c, table_3 d, table_4 e
>
> Out of 11088 rows selected, 103 are not null.
>
> On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on
> sqlite3, and 0.31 sec on PostgreSQL.
>

SQL Server does a good job at caching data in memory. PostgreSQL does too
on consecutive calls to the same table. What execution time do you get if
you issue the query a second time?

My first guess would be to add an index on my_table.tran_date and check in
EXPLAIN that you don't have a SEQUENTIAL SCAN on that table.


> I have looked at the EXPLAIN, but I don't really know what to look for.
> I can supply it if that would help.
>
> Thanks for any advice.
>
> Frank Millman
>
>
--
Olivier Gautherot


Slow SELECT

2020-05-26 Thread Frank Millman

Hi all

I have a SELECT that runs over 5 times slower on PostgreSQL compared 
with Sql Server and sqlite3. I am trying to understand why.


I have a table that looks like this (simplified) -

CREATE TABLE my_table (
row_id SERIAL PRIMARY KEY,
deleted_id INT DEFAULT 0,
fld_1 INT REFERENCES table_1(row_id),
fld_2 INT REFERENCES table_2(row_id),
fld_3 INT REFERENCES table_3(row_id),
fld_4 INT REFERENCES table_4(row_id),
tran_date DATE,
tran_total DEC(21,2)
);

CREATE UNIQUE INDEX my_table_ndx ON my_table (fld_1, fld_2, fld_3, 
fld_4, tran_date) WHERE deleted_id = 0;


The table sizes are -
my_table : 167 rows
table_1 : 21 rows
table_2 : 11 rows
table_3 : 3 rows
table_4 : 16 rows

Therefore for each tran_date in my_table there are potentially 
21x11x3x16 = 11088 rows. Most will be null.


I want to select the row_id for the last tran_date for each of those 
potential groups. This is my select -


SELECT (
SELECT a.row_id FROM my_table a
WHERE a.fld_1 = b.row_id
AND a.fld_2 = c.row_id
AND a.fld_3 = d.row_id
AND a.fld_4 = e.row_id
AND a.deleted_id = 0
ORDER BY a.tran_date DESC LIMIT 1
)
FROM table_1 b, table_2 c, table_3 d, table_4 e

Out of 11088 rows selected, 103 are not null.

On identical data, this takes 0.06 sec on SQL Server, 0.04 sec on 
sqlite3, and 0.31 sec on PostgreSQL.


I have looked at the EXPLAIN, but I don't really know what to look for. 
I can supply it if that would help.


Thanks for any advice.

Frank Millman