Re: [HACKERS] Check that streaming replica received all data after master shutdown

2015-01-15 Thread Sameer Kumar
 that occurred to ​me as well while reading the first
part of your comments. :)


 Or doing any sanity check of the database on the standby
 utilizing the nature of the data instead?




Best Regards,

*Sameer Kumar | Database Consultant*

*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: *+65 8110 0350*  T: +65 6438 3504 | www.ashnik.com

*[image: icons]*



[image: Email patch] http://www.ashnik.com/



This email may contain confidential, privileged or copyright material and
is solely for the use of the intended recipient(s).


Re: [HACKERS] Check that streaming replica received all data after master shutdown

2015-01-15 Thread Sameer Kumar
On Wed, Jan 14, 2015 at 2:11 AM, Heikki Linnakangas hlinnakan...@vmware.com
 wrote:

 On 01/13/2015 12:11 PM, Vladimir Borodin wrote:


 05 янв. 2015 г., в 18:15, Vladimir Borodin r...@simply.name написал(а):

  Hi all.

 I have a simple script for planned switchover of PostgreSQL (9.3 and
 9.4) master to one of its replicas. This script checks a lot of things
 before doing it and one of them is that all data from master has been
 received by replica that is going to be promoted. Right now the check is
 done like below:

 On the master:

 postgres@pgtest03d ~ $ psql -t -A -c 'select
 pg_current_xlog_location();'
 0/3390
 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_ctl stop -m fast
 waiting for server to shut down done
 server stopped
 postgres@pgtest03d ~ $ /usr/pgsql-9.3/bin/pg_controldata | head
 pg_control version number:937
 Catalog version number:   201306121
 Database system identifier:   6061800518091528182
 Database cluster state:   shut down
 pg_control last modified: Mon 05 Jan 2015 06:47:57 PM MSK
 Latest checkpoint location:   0/3428
 Prior checkpoint location:0/3328
 Latest checkpoint's REDO location:0/3428
 Latest checkpoint's REDO WAL file:001B0034
 Latest checkpoint's TimeLineID:   27
 postgres@pgtest03d ~ $

 On the replica (after shutdown of master):

 postgres@pgtest03g ~ $ psql -t -A -c select
 pg_xlog_location_diff(pg_last_xlog_replay_location(), '0/3428');
 104
 postgres@pgtest03g ~ $

 These 104 bytes seems to be the size of shutdown checkpoint record (as I
 can understand from pg_xlogdump output).

 postgres@pgtest03g ~/9.3/data/pg_xlog $ /usr/pgsql-9.3/bin/pg_xlogdump
 -s 0/3390 -t 27
 rmgr: XLOGlen (rec/tot):  0/32, tx:  0, lsn:
 0/3390, prev 0/3328, bkp: , desc: xlog switch
 rmgr: XLOGlen (rec/tot): 72/   104, tx:  0, lsn:
 0/3428, prev 0/3390, bkp: , desc: checkpoint: redo 0/3428;
 tli 27; prev tli 27; fpw true; xid 0/6010; oid 54128; multi 1; offset 0;
 oldest xid 1799 in DB 1; oldest multi 1 in DB 1; oldest running xid 0;
 shutdown
 pg_xlogdump: FATAL:  error in WAL record at 0/3428: record with zero
 length at 0/3490

 postgres@pgtest03g ~/9.3/data/pg_xlog $

 I’m not sure that these 104 bytes will always be 104 bytes to have a
 strict equality while checking. Could it change in the future? Or is there
 a better way to understand that streaming replica received all data after
 master shutdown? The check that pg_xlog_location_diff returns 104 bytes
 seems a bit strange.


 Don't rely on it being 104 bytes. It can vary across versions, and across
 different architectures.

 You could simply check that the standby's pg_last_xlog_replay_location() 
 master's Latest checkpoint location, and not care about the exact
 difference.




​I believe there were some changes made in v9.3 which will wait for pending
WALs to be replica​ted before a fast and smart shutdown (of master) can
close the replication connection.

http://git.postgresql.org/pg/commitdiff/985bd7d49726c9f178558491d31a570d47340459


Re: [HACKERS] Using pg_rewind for differential backup

2014-12-03 Thread Sameer Kumar
On Wed, Dec 3, 2014 at 1:32 AM, Robert Haas robertmh...@gmail.com wrote:

 On Fri, Nov 28, 2014 at 2:49 AM, Heikki Linnakangas
 hlinnakan...@vmware.com wrote:
  It also would be quite straightforward to write a separate tool to do
 just
  that. Would be better than conflating pg_rewind with this. You could use
  pg_rewind as the basis for it - it's under the same license as
 PostgreSQL.

 ​Agree. I would want to work on that.​


 If we had such a tool in core, would that completely solve the
 differential backup problem, or would more be needed?


​In my opinion​ yes.

​
 ​This discussion is not really adapted on hackers as pg_rewind is not
 included in Postgres core code. Please let's discuss your proposal there.​


​Got it​. But given the above, can we keep this discussion thread here?



Best Regards,

*Sameer Kumar | Database Consultant*

*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: *+65 8110 0350*  T: +65 6438 3504 | www.ashnik.com

*[image: icons]*



[image: Email patch] http://www.ashnik.com/



This email may contain confidential, privileged or copyright material and
is solely for the use of the intended recipient(s).


[HACKERS] Using pg_rewind for differential backup

2014-11-27 Thread Sameer Kumar
Can we tweak pg_rewind to take differential backups in PostgreSQL?

I was wondering can we hack the pg_rewind code to print the details of the
file which have been modified compared to a target server.
The list output can then be used for taking differential backups.

Or perhaps we can add an option/switch in pg_rewind --action

--action=print --- would print the files which have changed
--action=sync --- would sync them
--action=copy --- with this option I can specify an additional optino
--target-dir where I can copy the files which have changed blocks


pg_rewind-
   https://github.com/vmware/pg_rewind


Best Regards,

*Sameer Kumar | Database Consultant*

*ASHNIK PTE. LTD.*

101 Cecil Street, #11-11 Tong Eng Building, Singapore 069533

M: *+65 8110 0350 %2B65%208110%200350*  T: +65 6438 3504 | www.ashnik.com

*[image: icons]*



[image: Email patch] http://www.ashnik.com/



This email may contain confidential, privileged or copyright material and
is solely for the use of the intended recipient(s).


Re: [HACKERS] using rpmbuild with PostgreSQL 9.2.6 source code

2014-01-20 Thread Sameer Kumar
 you need installed devel packages



I tried to install these packages (uuid-devel and systemd-unit) but yum can
not locate these packages. I tried to install selinux-policy but I found
out that selinux-policy 3.7.19 is already installed and is the latest
package available in Red Hat repository (whereas the build process is
looking for 3.9.13).

Regards
Sameer


Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar


  I read somewhere that the best editor is the
 one you master (1) :)
 1: http://www.postgresql.org/message-id/m2wrs6giyp@hi-media.com


Thanks, I am using eclipse now.


Any comments about the utility of this feature? Or is it just me who thinks
this can be useful? I think users/developers of trigger based replication
tools can benefit from this.

I am not sure how replication tools like slony handle a scenario where
replicated table is partitioned and I add a new partition to it.


Regards
Sameer


Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar
I have been finally able to get the right set of files. I going with below
approach:


1) Add a new column in pg_trigger called tgiscascaded

2) Change pg_trigger.h for this

3) Made changes in trigger.c to insert this values

4) Corresponding changes made in
 - reltrigger.h
 - parsenode.h

5) Changed gram.y to understand a new key word in CREATE TRIGGER statement
CASCADED.
- The new option will be optional and will apply only to
non-constraint triggers
- If the option is specified trigger will CASCADE to child
tables

6) I just complied the source code (modified with above changes) and it
- Added a new column in pg_trigger
- The new column is able to derive its value from CREATE
TRIGGER statement based on whether CASDADED was specified or not
- The value is True if the option was specified
- The value is false if the option was not specified


7) Now I will work on trigger firing mechanism with below approach
   - Before firing triggers, check if it's an inherited table
   - First get CASCADED triggers for parents and fire them
   - Proceed as usual


Any suggestion on improving the approach or comments on need for this
feature are welcome.


Regards
Sameer


Re: [HACKERS] Changes in Trigger Firing

2013-12-05 Thread Sameer Kumar
One scenario where I can forsee an issue is when someone uses the tablename
with-in the trigger function on which the trigger was fired. I am not sure
how and what issue might crop up but this will be one of my test cases.



--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Hacker-RFC-Changes-in-Trigger-Firing-tp5781566p5781861.html
Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Changes in Trigger Firing

2013-12-04 Thread Sameer Kumar


 
  CreateTrigStmt is passed to CreateTrigger function as an arguement. I am
  struggling to understand how the values for various members of trigger
 are
  set and where [which file] calls CreateTrigStmt.
 
 
  Can someone provide some help on this?

 I think you need better tools to guide you in exploring the source code.
 For example, you can use cscope to tell you where is CreateTrigStmt
 used, and you would find gram.y; and use it to tell you where
 CreateTrigger is used, and you would find utility.c.

 Thanks for your advice. I was relying on PostgreSQL documentation which
was quite helpful so far. Let me try some development tool.


 Any half-decent code editor should be able to generate a database of
 symbols and let you frolic around the various files quickly.  Without
 that, anyone would be completely lost in developing new features of even
 the lowest complexity.


Got the point!


[HACKERS] Changes in Trigger Firing

2013-12-03 Thread Sameer Kumar
Hi all,

I am doing some changes in trigger firing mechanism (as a POC first, I will
share my work if people find it useful and if it works).

I am going to extend the trigger calls to child tables. So that in trigger
definition I can create triggers with CASCADED TRUE | FALSE option.

This will enable me to extend certain triggers to child tables and get over
the scenarios when a record gets inserted in child table and any *after
insert* triggers on parent tables never get fired.

I am right now trying to change pg_trigger and CreateTrigger function. I
have made changes in below files:
trigger.c:
 - Added a New Variable cascadedOption (bool)
 - Added a new statement in trigger.c to set the value:
 values[Anum_pg_trigger_tgiscascaded - 1] =
BoolGetDatum(stmt-cascadedOption);
 -  Added a new column in CATALOG definition of pg_trigger in
pg_trigger.h
   bool tgiscascaded;
 -  In pg_trigger.h, added a new position for values
#define Anum_pg_trigger_tgiscascaded16
 - Modified parsenode.h to add a new element in Structure for trigger
statement -CreateTrigStmt

 boolcascadedOption;

CreateTrigStmt is passed to CreateTrigger function as an arguement. I am
struggling to understand how the values for various members of trigger are
set and where [which file] calls CreateTrigStmt.


Can someone provide some help on this?


Best Regards,
*Sameer Kumar | Database Consultant*

*ASHNIK PTE. LTD. *101 Cecil Street, #11-11 Tong Eng Building, Singapore
069533
M : *+65 8110 0350* T: +65 6438 3504 | www.ashnik.com
www.facebook.com/ashnikbiz | www.twitter.com/ashnikbiz

[image: email patch]

This email may contain confidential, privileged or copyright material and
is solely for the use of the intended recipient(s).
image002.jpg

Re: [HACKERS] Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-11-18 Thread Sameer Kumar
On Thu, Nov 14, 2013 at 6:51 PM, Kyotaro HORIGUCHI 
horiguchi.kyot...@lab.ntt.co.jp wrote:

 Hello,

  When I read it again and try to relate, I get your point. Actually true,
  hashes must always be performed as last option (if that is what you too
  meant) and if there are few other operations they must be the last one to
  be performed especially after sorting/grouping. Hashes must somehow make
  use of already sorted data (I think this something even you indicated)

 Yes, some 'hash'es could preserve order selecting such a function
 for hash function. But at least PostgreSQL's 'HashAggregation'
 uses not-order-preserving function as hash function. So output
 cannot preserve input ordering.

  I will do that if I get a DB2 system or Oracle system running. I will try
  to replicate the same 2 test cases and share the plan. One thing which I
 am
  sure is, the below part of the plan
 
  QUERY PLAN | Subquery Scan on __unnamed_subquery_0
   (cost=12971.39..16964.99 rows=614 width=43) (actual
  time=2606.075..2953.937 rows=558 loops=1)
 
  would be generated as RID scan in DB2 (which I have seen to perform
 better
  than normal subquery scans in DB2).

 DB2's document says it is used for 'index ORing' corresponds OR
 or IN ops, which seems to be a relative to BitmapOr of
 PostgreSQL, perhaps not to HashAggregates/SemiJoin.

 I tried to imagin the plan for the group_by case with repeated
 index scan and merging..

  select student_name
  from student_score
  where (course,score) in (select course,max(score)
   from student_score group by course);

 Taking the advantage that the cardinarity of course is 8, this
 query could be transformed into 8 times of index scan and
 bitmaping.

 With hypothetical plan node LOOP, and BitmapScanAdd the plan
 could be,

 | BitmapHeapScan (rows = 155, loops = 1)
 |  - LOOP
 | ON Subquery (select distinct course from student_course) as c0
 | - BitmapScanAdd (loops = 8)
 |BitmapCond: (student_score.score = x)
 |   - Limit (rows = 1, loops = 8) AS x
 | - Unique (rows = 1, loops = 8)
 |   - IndexScan using idx_score on student_course (rows = 1,
 loops = 8)
 |  Filter (student_course.course = c0)

 I suppose this is one possibility of what DB2 is doing. If DB2
 does the same optimization for ranking  1 with the dense_rank()
 case, this plan might be like this,

 I can not be sure but this one seems logically correct from cost and
cardinality perspective(am not sure the operations that the DB2 planner
would generate ). Need to test it.



 | BitmapHeapScan (rows = 133, loops = 1)
 |  - LOOP
 | ON Subquery (select distinct course from student_course) as c0
 | - BitmapScanAdd (loops = 8)
 |BitmapCond: (student_score.score = x)
 |   - Limit (rows = 1, loops = 8) AS x
 | - Unique (rows = 2, loops = 8)
 |   - IndexScan using idx_score on student_course (rows =
 18,loops= 8)
 |  Filter (student_course.course = c0)

 Both plans surely seem to be done shortly for relatively small
 n's and number of courses.

 I guess they would do well even when the cardinality of courses is fairly
high (unless we hit a scenario where courses offered are more than/in the
same decimal range as students opting for them).

On the other hand, using semijoin as PostgreSQL does, creating
 HashAggregate storing nth place score for every course requires
 some memory to work on for each course.

 | Hash Semi Join
 |   Hash Cond: (a.course = b.course and a.score = b.score)
 |  - SeqScan on student_score as a
 |  - Hash
 |   - HashAggregatesFunc (rows = 8)
 |  Key = course, func = rankn(dense_rank(), n, key, val)
 |- SeqScan on student_score (rows = 122880)

 Where, rankn() must keep socres down to nth rank and emits nth
 score as final value. I don't get more generic form for this
 mechanism right now, and the value to do in this specific manner
 seems not so much..


 I feel the advantage could be more when dealing with a DW environment
which has more complex aggregate and windowing queries. Extending this to
other windowing function, it could be a great gain for DW and OLAP queries.


Regards
Sameer


Re: [HACKERS] [PATCH] Use MAP_HUGETLB where supported (v3)

2013-11-16 Thread Sameer Kumar
I was recently running some tests with huge page tables. I ran them on two
different architectures: x86 and PPC64.

I saw some discussion going on over here so thought of sharing.
I was using 3 Cores, 8GB RAM, 2 LUN for filesystem (1 for dbfiles and 1 for
logfiles) for these tests...

I had dedicated
(shared_buffers + 400bytes*max_connection + wal_buffers)/Pagesize [from
/proc/meminfo] for huge pages. I kept some overcommit_hugepages to be used
by work_mem (max_connection*work_mem)/Pagesize


x86_64 bit gave me a benefit of 2-5% for TPC-C workload( I scaled from 1 to
100 users). PPC64 which uses 16MB and 64MB did not give me any benefits in
fact the performance degraded as the concurrency of system increased.

my 2 cents, hope it helps.


Re: [HACKERS] Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-11-05 Thread Sameer Kumar
 Hello,

   With this index, you will get a different plan like this,
  
  Exactly my point, can we look at making windowing functions
  smart and make use of available indexes?

 I might have guessed..


   Does this satisfies your needs?
  
  Not exactly. If I have missed to mention, this is not a
  production issue for me. I am trying to see if PostgreSQL
  planner produces best plans for Data Warehouse and mining
  oriented queries.

 I agree to the point.

  I think Hashes can be efficiently used for sorting (and I
  believe they are used for joins too when a pre-sorted data set
  is not available via indexes). This again could my
  misinterpretation.

 It is true if 'Sorting' means 'key classification without
 orderings'. Hashes should always appear at inner side of a join,
 I'm convinced. The ordered' nature is not required for the case
 if outer side is already ordered. If not, separate sorting will
 needed.

  I lost you somewhere here. My be this is above my pay-grade :-)

 Sorry for my crumsy english :-


No, it was not your English. :-)
When I read it again and try to relate, I get your point. Actually true,
hashes must always be performed as last option (if that is what you too
meant) and if there are few other operations they must be the last one to
be performed especially after sorting/grouping. Hashes must somehow make
use of already sorted data (I think this something even you indicated)


 Well, at least with Oracle and DB2 planners I have seen that
  the plan produced with dense_rank performs better than a series
  of nested SELECT MAX().

 I see your point. Although I don't know what plans they
 generates, and I don't see how to ordering and ranking without
 sorting.  Could you let me see what they look like?

 # Nevertheless, I don't have the confidence that I can be of some
 # help..

 I will do that if I get a DB2 system or Oracle system running. I will try
to replicate the same 2 test cases and share the plan. One thing which I am
sure is, the below part of the plan

QUERY PLAN | Subquery Scan on __unnamed_subquery_0
 (cost=12971.39..16964.99 rows=614 width=43) (actual
time=2606.075..2953.937 rows=558 loops=1)

would be generated as RID scan in DB2 (which I have seen to perform better
than normal subquery scans in DB2).



Regards
Sameer
Ashnik Pte Ltd


Re: [HACKERS] Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-10-28 Thread Sameer Kumar



  Agree that windowing function will return all the rows compared to max
 and
  group by returing only max rows per group. But even while arriving at the
  aggregate/sorting windowing function seems to spend more effort than
 group
  by/order by.

 (I'll apologise in advance for possible misreading..)

 The most cause of the difference in time comes from sorting. Over
 90% of total execution time has elapsed while sorting
 (49ms-2733ms) for the one using windowing function. If this sort
 were useless, the execution time would be less than 300 ms -
 seems comparable enough to group-by query.

 I will agree with you on this point.



 | Subquery Scan on __unnamed_subquery_0
 |  (actual time=2606.075..2953.937 rows=558 loops=1)
 |   Filter: (__unnamed_subquery_0.rn = 1)
 |   -  WindowAgg  (actual time=2606.063..2928.061 rows=122880 loops=1)
 | -  Sort (actual time=2606.020..2733.677 rows=122880 loops=1)
 |   Sort Key: student_score.course, student_score.score
 |   -  Seq Scan on student_score
 |   (actual time=0.009..49.026 rows=122880 loops=1)

 As you see in above plan, sorting key is (course, score). If your
 point is the overall performance but not reusing a kind of
 'hash', there's a big chance to eliminate this sorting if you are
 able to have an additional index, say,

 =# create index idx_co_sc on student_score using btree (course, score);

 With this index, you will get a different plan like this,

 Exactly my point, can we look at making windowing functions smart and make
use of available indexes?



  uniontest=# explain analyze select student_name from (select
 student_name, dense_rank() over(partition by course order by score) rn,
 score from student_score) rnn where rn=2;
QUERY PLAN
 
 ---
   Subquery Scan on rnn  (actual time=0.088..319.403 rows=135 loops=1)
 Filter: (rnn.rn = 2)
 Rows Removed by Filter: 122746
 -  WindowAgg  (actual time=0.037..296.851 rows=122881 loops=1)
   -  Index Scan using idx_co_sc on student_score
 (actual time=0.027..111.333 rows=122881 loops=1)
   Total runtime: 319.483 ms

 Does this satisfies your needs?

 Not exactly. If I have missed to mention, this is not a production issue
for me. I am trying to see if PostgreSQL planner produces best plans for
Data Warehouse and mining oriented queries.

 ===
  Another thing, (I may be stupid and naive here) does PostgreSQL
  re-uses the hash which has been already created for sort. In
  this case the inner query must have created a hash for windoing
  aggregate. Can't we use that same one while applying the the
  filter rn=1 ?

 Generally saying, hashes cannot yield ordered output by its
 nature, I believe.

 I think Hashes can be efficiently used for sorting (and I believe they are
used for joins too when a pre-sorted data set is not available via
indexes). This again could my misinterpretation.

Windowing function (execnode) always receives tuples sequentially
 in the window-defined order (as you see in the explained plan
 above) then processes the tuples in semi tuple-by-tuple manner to
 perform per-frame aggregaion, and finally outputs tuples of the
 same number to input. And furthermore, dense_rank() doesn't even
 need per-frame aggregations. So none of the planners so far seems
 to have chance to use a kind of hash tables to culculate/execute
 windowing fucntions. On the another point, automatically
 preserving some internal data within a query beyond the end of
 the query brings in 'when to discard it?' problem.


 I lost you somewhere here. My be this is above my pay-grade :-)
Well, at least with Oracle and DB2 planners I have seen that the plan
produced with dense_rank performs better than a series of nested SELECT
MAX().

Regards
Sameer


[HACKERS] Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-10-23 Thread Sameer Kumar
Hi,

I was dealing with windowing function recently. I feel they are pretty
useful and quite handy in lot of operations.

I am not sure why but my PostgreSQL does not seem to be using indexes for
ORDER BY clause or PARTITION BY CLAUSE which I use with windowing function.
I have tried ORDER BY and GROUP BY clauses in a normal sql statement and
they seem to use indexes nicely.

Is this being already considered for development?

Best Regards,
*Sameer Kumar | Database Consultant*
*ASHNIK PTE. LTD.*


Re: [HACKERS] Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-10-23 Thread Sameer Kumar
On Wed, Oct 23, 2013 at 10:58 PM, Tom Lane t...@sss.pgh.pa.us wrote:

 Sameer Kumar sameer.ku...@ashnik.com writes:
  I am not sure why but my PostgreSQL does not seem to be using indexes for
  ORDER BY clause or PARTITION BY CLAUSE which I use with windowing
 function.

 When the entire contents of the table have to be read, a seqscan-and-sort
 will frequently be estimated as cheaper than an indexscan.  If you think
 this is not true on your hardware, you might need to adjust
 random_page_cost.

 regards, tom lane

My mistake. I had understood the issue wrongly.

Actually when I use functions like max to find the maximum value grouped by
another column I get a better performance when I try to do the same
operation using max() over().

Take a look at below plan:

edb=# \x
Expanded display is on.
edb=# \dS= student_score;
 Table enterprisedb.student_score
Column|  Type   | Modifiers
--+-+---
 id   | integer | not null
 student_name | character varying(1000) |
 score| integer |
 course   | character varying(100)  |
Indexes:
student_score_pkey PRIMARY KEY, btree (id)
idx_course btree (course)
idx_score btree (score)

edb=# select count(*) from student_score ;
-[ RECORD 1 ]-
count | 122880

edb=# explain analyze select max(score) from student_score group by course;
-[ RECORD 1
]-
QUERY PLAN | HashAggregate  (cost=3198.20..3198.26 rows=6 width=9) (actual
time=110.792..110.793 rows=6 loops=1)
-[ RECORD 2
]-
QUERY PLAN |   -  Seq Scan on student_score  (cost=0.00..2583.80
rows=122880 width=9) (actual time=0.011..23.055 rows=122880 loops=1)
-[ RECORD 3
]-
QUERY PLAN | Total runtime: 110.862 ms

edb=# explain analyze select max(score) over(partition by course) from
student_score ;
-[ RECORD 1
]
QUERY PLAN | WindowAgg  (cost=0.00..10324.65 rows=122880 width=9) (actual
time=36.145..224.504 rows=122880 loops=1)
-[ RECORD 2
]
QUERY PLAN |   -  Index Scan using idx_course on student_score
 (cost=0.00..8481.45 rows=122880 width=9) (actual time=0.037..85.283
rows=122880 loops=1)
-[ RECORD 3
]
QUERY PLAN | Total runtime: 242.949 ms

AS you can see there is a difference of twice. On similar lines, when I
have to find students who topped (had highest score) per course, I will
fire something like below:



edb=# explain analyze select student_name from student_score where
(course,score)in (select course,max(score) from student_score group by
course);
-[ RECORD 1
]---
QUERY PLAN | Hash Semi Join  (cost=3198.41..6516.76 rows=7300 width=43)
(actual time=113.727..181.045 rows=555 loops=1)
-[ RECORD 2
]---
QUERY PLAN |   Hash Cond: (((enterprisedb.student_score.course)::text =
(enterprisedb.student_score.course)::text) AND
(enterprisedb.student_score.score =
(max(enterprisedb.student_score.score
-[ RECORD 3
]---
QUERY PLAN |   -  Seq Scan on student_score  (cost=0.00..2583.80
rows=122880 width=52) (actual time=0.009..22.702 rows=122880 loops=1)
-[ RECORD 4
]---
QUERY PLAN |   -  Hash  (cost=3198.32..3198.32 rows=6 width=9) (actual
time=111.521..111.521 rows=6 loops=1)
-[ RECORD 5
]---
QUERY PLAN | Buckets: 1024  Batches: 1  Memory Usage: 1kB
-[ RECORD 6

Re: [HACKERS] Re: Using indexes for ORDER BY and PARTITION BY clause in windowing functions

2013-10-23 Thread Sameer Kumar
Agree that windowing function will return all the rows compared to max and
group by returing only max rows per group. But even while arriving at the
aggregate/sorting windowing function seems to spend more effort than group
by/order by.

I am just trying to see if we could somehow optimize the way windowing
operations are performed. (May be in query rewrite). Datawarehouses could
use that improvement.
Its not my production box, so I can live with disk sort. I have tried with
huge sorting memory but still I see a similar difference in cost of sorting
for grouping/ordering Vs windowing function.
Another thing regarding work_memory, I have generally seen that windowing
functions expect more amount of memory for sorting compared to
grouping/ordering clauses.
On 24 Oct 2013 10:54, David Johnston pol...@yahoo.com wrote:

 Sameer Kumar wrote
  edb=# explain analyze select max(score) from student_score group by
  course;

 This query returns 6 records.  The window one returns 123,000.  Why do you
 expect these to have anywhere near the same performance or plan?

 You can enable/disable indexes/scans to see what alternatives plans may
 provide but nothing here stands out as being obviously incorrect.

 I'm not really clear on what your question is.  Generally it sounds as if
 you are wondering if there are any plans to I prove the algorithms behind
 window function processing.  Are you just looking at symptoms and thus
 possibly have unreasonable expectations or do you actually see an avenue
 for
 improvement in the engine?


  QUERY PLAN |   Sort Method: external merge  Disk: 7576kB

 Work memory; I/O is killing your performance on this query.  It is more
 flexible but you pay a price for that.


  Another thing, (I may be stupid and naive here) does PostgreSQL re-uses
  the
  hash which has been already created for sort. In this case the inner
 query
  must have created a hash for windoing aggregate. Can't we use that same
  one
  while applying the the filter rn=1 ?

 Probably but others more knowledgable will need to answer authoritatively.

 David J.





 --
 View this message in context:
 http://postgresql.1045698.n5.nabble.com/Using-indexes-for-ORDER-BY-and-PARTITION-BY-clause-in-windowing-functions-tp5775605p5775708.html
 Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.


 --
 Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
 To make changes to your subscription:
 http://www.postgresql.org/mailpref/pgsql-hackers