2017-09-14 15:09 GMT+02:00 Pavel Stehule <pavel.steh...@gmail.com>:


  2017-09-14 14:59 GMT+02:00 Frank Millman <fr...@chagford.com>:

    Pavel Stehule wrote:

    2017-09-14 10:14 GMT+02:00 Frank Millman <fr...@chagford.com>:

      Hi all

      This is a follow-up to a recent question I posted regarding a slow query. 
I thought that the slowness was caused by the number of JOINs in the query, but 
with your assistance I have found the true reason. I said in the previous 
thread that the question had become academic, but now that I understand things 
better, it is no longer academic as it casts doubt on my whole approach.

      I have split my AR transaction table into three physical tables – 
ar_tran_inv, ar_tran_crn, ar_tran_rec. I will probably add others at some 
point, such as ar_tran_jnl.

      I then create a VIEW to view all transactions combined. The view is 
created like this -

      CREATE VIEW ar_trans AS
        SELECT ‘ar_inv’ AS tran_type, row_id AS tran_row_id, tran_number ... 
FROM ar_tran_inv WHERE posted = ‘1’
        UNION ALL
        SELECT ‘ar_crn’ AS tran_type, row_id AS tran_row_id, tran_number ... 
FROM ar_tran_crn WHERE posted = ‘1’
        UNION ALL
        SELECT ‘ar_rec’ AS tran_type, row_id AS tran_row_id, tran_number ... 
FROM ar_tran_rec WHERE posted = ‘1’

      I have another table called ‘ar_trans_due’, to keep track of outstanding 
transactions. All of the three transaction types generate entries into this 
table. To identify the source of the transaction, I have created columns in 
ar_trans_due called ‘tran_type’ and ‘tran_row_id’. After inserting a row into 
‘ar_tran_inv’, I invoke this -

        INSERT INTO ar_trans_due (tran_type, tran_row_id, ...) VALUES 
(‘ar_inv’, ar_tran_inv.row_id, ...), and similar for the other transaction 
types. It is handled by a Python program, and it all happens within a 
transaction.

      When I view a row in ar_trans_due, I want to retrieve data from the 
source transaction, so I have this -

        SELECT * FROM ar_trans_due a
        LEFT JOIN ar_trans b ON b.tran_type = a.tran_type AND b.tran_row_id = 
a.tran_row_id

      I understand that PostgreSQL must somehow follow a path from the view 
‘ar_trans’ to the physical table ‘ar_tran_inv’, but I assumed it would execute 
the equivalent of SELECT * FROM ar_tran_inv WHERE row_id = a.tran_row_id AND 
posted = ‘1’.

      If this was the case, it would be an indexed read, and very fast. 
Instead, according to EXPLAIN, it performs a sequential scan of the 
‘ar_tran_inv’ table.

      It also scans ‘ar_tran_crn’ and ‘ar_tran_rec’, but EXPLAIN shows that it 
uses a Bitmap Heap Scan on those. I assume that is because the tables are 
currently empty.

      Is this analysis correct?

    please, send EXPLAIN ANALYZE result :) 


    I tried to reduce this to its simplest form.

    Here is a SQL statement -

        SELECT *
        FROM ccc.ar_trans_due a
        LEFT JOIN ccc.ar_trans b ON
            b.tran_type = a.tran_type AND b.tran_row_id = a.tran_row_id
        WHERE a.row_id = 1

    ar_trans_due is a physical table, ar_trans is a view.

    It takes about 28ms. Here is the explain - https://explain.depesz.com/s/8YY


> The PostgreSQL cannot to push join - in slow case, the UNIONS should be done 
> first - and it requires full scan ar_tran_inv - used filter (posted AND 
> (deleted_id = 0) is not too effective - maybe some composite or partial index 
> helps.
> 
> The fast query doesn't contains unions - so there are bigger space for 
> optimizer - ar_tran_inv is filtered effective - by primary key. 
> 
> So main problem is impossible to push information a.row_id = 1 to deep to 
> query.
> 

Sorry for banging on about this, but someone might be interested in the 
following timings.

The only solution I could find was to ‘denormalise’ (if that is a word) and 
create additional columns on ar_trans_due for cust_row_id and tran_date, to 
avoid using any joins.

Once I had done that, I could run my query two ways – 
  1. using the newly created columns
  2. as before, using a join to the view, which in turn retrieved data from the 
underlying tables.

This was a more complex query than the example above – details available on 
request.

Here are the timings for running the query on identical data sets using 
Postgresql, Sql Server, and Sqlite3 -

PostgreSQL -
    Method 1 - 0.28 sec
    Method 2 – 1607 sec, or 26 minutes

Sql Server -
    Method 1 – 0.33 sec
    Method 2 – 1.8 sec

Sqlite3 -
    Method 1 – 0.15 sec
    Method 2 – 1.0 sec

It seems that Sql Server and Sqlite3 are able to analyse the ‘join’, and 
execute an indexed read against the underlying physical tables.

Frank

Reply via email to