Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-26 Thread David Johnston
On Wed, Aug 27, 2014 at 1:02 AM, Alex Goncharov <
alex.goncharov@gmail.com> wrote:

> > Thank you, Kevin -- this is helpful.
>
> Thank you David, too.
>
>
> > But it still leaves questions for me.
>
> Still...
>
>
> Alex Goncharov  wrote:
>
> >>> How do I decide, before starting a COPY data load, whether such a load
> >>> protection ("complexity") makes sense ("is necessary")?
>
> This is *the* practical question.
>
>
> David G Johnston  wrote:
>
> > You should probably consider something like: http://pgloader.io/
>
> This is not my question; I want to see if anybody can offer a
> meaningful situation evaluation strategy for a potential using or not
> using COPY for loading the "big data".
>

​OK.  Though I presume that given limitations to copy - of which the whole
"all-or-nothing" is one - that pointing out more user-friendly API's would
be worthwhile.​


> If nobody can, fine: it'll give me the reason to claim "Nobody knows".
>
> > Normal case, with normal COPY,
>
> This is the case I am asking about: the COPY operation limitations for
> the "big data": until what point a plain COPY can be used.
>
> > you load a bad file into an empty table, it fails, you truncate and
> > get better data for the next attempt.
>
> This is not how many businesses operate.
>
>
​Yet this is basically what you are asking about​



> > How long that will take is system (IOPS/CPU) and data dependent.
>
> "How long", was not the question: my question was originally about the
> behavior for a bad record at the end of a large data set submitted to
> COPY; when it was stated that the data "in process" becomes an
> invisible (until committed) part of the target table, it became
> obvious to me that the fundamental question has to be asked: "How much
> can fit there, in the temporary operational space (whatever it's
> called in PostgreSQL.)?"  "df /mount -> free" or "2^32"?
>
> > The probability of failure is source dependent - and prior
> > experience plays a large role here as well.
>
> Not the question.
>
> > If you plan to load directly into a live table the wasted space from
> > a bad load could kill you so smaller partial loads are better - if
> > you can afford the implicit system inconsistency such a partial load
> > would cause.
>
> Not the question.
>

​These were things to consider when deciding on whether it is worthwhile to
split the large file into chunks.​


> > If you understand how the system works
>
> I don't, to the necessary extent, so I asked for an expert opinion :)
>
> > you should be able to evaluate the different pieces and come to a
> > conclusion as how best to proceed in a specific situation.  No one
> > else on this list has the relevant information to make that
> > judgement call.
>
> We'll see; too early to tell yet :)
>
> > If this is just asking about rules-of-thumb
>
> Yes.
>
> > I'd say figure out how many records 100MB consumes and COMMIT after that
> > many records.
>
> Pardon me: I am running COPY and know how many records are processed
> so far?.. (Hmm... can't be.)
>

​Take you 1TB file, extract the first 100MB, count the number of
records-separators.  Commit after that many.​
​

>
> > 10,000 records is also a nice round number to pick - regardless of
> > the amount of MB consumed.  Start there and tweak based upon
> > experience.
>
> You are clearly suggesting to split the large data file into many
> small ones.  To split very intelligently, on the record boundaries.
>
> And since this is very hard and would involve quite another, external
> processing machinery, I am trying to understand until what point this
> is safe not to do (subject to what factors.)
>
>
​See thoughts to consider from previous e-mail.​


> > If you are not taking advantage of the "unlogged load" optimization,
>
> I don't see any way to control this for COPY only.  Are you talking
> about the 'postgresql.conf' settings?
>

​I am not sure if this is the same thing but I am pretty sure he is
referring to creating an unlogged table as the copy target - thus avoiding
WAL.​


> > If you only have 500k free in your archive directory that 1MB file
> > will pose a problem...though if you have 4TB of archive available
> > the 1TB would fit easily.
>
> So the answer to the "How much data can fit in the COPY storage
> areas?" question is solely a "df /mount/point" thing?
>
> I.e. before initiating the COPY, I should:
>
>ls -l DATA-FILE
>df -m /server/db-cluster/pg_data-or-something
>
> compare the two values and be assured that my COPY will reach the end
> of my DATA-FILE (whether is stumbles in the end or not) if the former
> value is meaningfully smaller than the latter?
>
> I would take this for the answer. (Let's see if there are other
> evaluation suggestions.)
>

​That should get the copy to succeed though whether you blow up your
archives or slaves would not be addressed.


> > Do you compress your WAL files before shipping them off to the
> > archive?  How compressible is your data?
>
> Try to give me the up

Re: [PERFORM] autocommit (true/false) for more than 1 million records

2014-08-25 Thread David Johnston
On Mon, Aug 25, 2014 at 9:40 AM, Emi Lu  wrote:

>
> By the way, could someone let me know why set autocommit(false) is for
> sure faster than true please? Or, some online docs talk about this.
>
>
​Not sure about the docs specifically but:

Commit is expensive because as soon as it is issued all of the data has to
be guaranteed written.  ​While ultimately the same amount of data is
guaranteed by doing them in batches there is opportunity to achieve
economies of scale.

(I think...)
When you commit you flush data to disk - until then you can make use of
RAM.  Once you exhaust RAM you might as well commit and free up that RAM
for the next batch.

David J.


Re: [PERFORM] parse/bind/execute

2014-06-04 Thread David Johnston
Please do not top-posts on these lists.

On Wednesday, June 4, 2014, Huang, Suya  wrote:

> Thank you David, I copied the detailed activity from the report as below.
> As it shows, it has prepare and bind queries.  One of the item has
> Bind/Prepare pretty high as 439.50. so that looks like a good value?
>
> Another question is if bind only happens in a prepared statement?
>
> Day HourPrepare BindBind/PreparePercentage of
> prepare
> Jun 03  00  205 209 1.021.27%
> 01  19  19  1.000.17%
> 02  0   0   0.000.00%
> 03  0   0   0.000.00%
> 04  6   6   1.000.00%
> 05  2   879 439.50  0.02%
> 06  839 1,323   1.587.01%
> 07  0   0   0.000.00%
>
>
>
Yes. Something that high usual involves batch inserting into a table.  To
be honest, a global picture is of limited value for this very reason.
 Representing all of your usage as a single number is problematic.
 Breaking it down by hour as done here increases the likelihood of seeing
something useful but typically that would be by chance.  In this case
because batch processing is done in the early morning and few users are
probably on the system (a common source of one-off statements) the numbers
here are dominated by the special case of bulk inserts and are not typical
of normal activity and performance.

David J.


Re: [PERFORM] pg_dump vs pg_basebackup

2014-03-25 Thread David Johnston
gianfranco caca wrote
> Hai,
> 
> Can anyone tell me the difference and performance between pgdump and
> pg_basebackup if I want to backup a large database.
> 
> Thanks

Yes.  And many of their words have been written down in the documentation in
a chapter named "Backup and Restore".  Do you have a specific question about
what is written there?

I'll add that comparing the performance of both is relatively meaningless.
You need to understand how each works then choose the correct tool for your
situation.

Lastly, you should actually do both, on a development database, and measure
the time and effort while practicing both routines (backup and restoring)
yourself.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/pg-dump-vs-pg-basebackup-tp5797351p5797364.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Connection pooling - Number of connections

2014-03-21 Thread David Johnston
Sethu Prasad wrote
> Reaching the maxPoolSize from the minPoolSize means creating the
> connections at the crucial moment where the client application is in the
> desperate need of completing an important query/transaction which the
> primary responsibility since it cannot hold the data collected.

One query is slowed down a little in the unusual situation where not enough
pooled connections are available.  To fix that you want to slow down the
entire server all of the time?  Really?  And even if this is sometimes the
best option your assertion is unqualified so do you really think this is
best for everyone, always?

I think it is good to give developers options but if your situation is 10 /
100 then a fixed 100 connection pool is probably not the best configuration.

The question I'd ask is if you are developing a new driver what problem and
use-case are you trying to accommodate?  For those in the general case a
resizing pool is probably the best bet.  It will usually stabilize at
typical volume so that an optimum number of connections are maintained while
still allowing for some expansion in times of excess demand.  A fixed size
pool would be something an experienced user would decide they need based
upon their usage patterns and need to eke out every last bit of performance
in the extremes situations while only trading a little bit of performance
when the connections are not maxed out.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Connection-pooling-Number-of-connections-tp5797025p5797061.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Connection pooling - Number of connections

2014-03-21 Thread David Johnston
Guillaume Smet wrote
> Brett Wooldridge, the creator of HikariCP [1] - a high performance
> Java connection pool - is contemplating the idea to change the way
> pooling is done in HikariCP and have a fixed-size pool of connections
> always open.
> 
> No maxPoolSize, no minIdle, no minPoolSize, juste a poolSize parameter
> which sets the size of the pool. At application startup, all the
> connections are opened and maintained by the pool throughout the life
> of the application.
> 
> The basic idea is that if you decide that your application might need
> 100 connections at time, you set poolSize to 100 and HikariCP
> maintains 100 connections open.
> 
> I recall very old posts on this list where people were talking about
> code paths sensitive to the number of connections open (or even
> max_connections) and that it wasn't such a good idea to keep
> connections open if they were not really needed.
> 
> As a lot of scalability work has been done since this (very old) time,
> I was wondering if it was still the rule of thumb or if the idea of
> Brett to completely simplify the connection management is the way to
> go.
> 
> It seems that at least another pool implementation is going this way
> so I thought it might be a good idea to have the opinion of the
> database side of things. This way, it will be easier to take a well
> informed decision.

The developer, not the pool implementer, is going to ultimately decide which
trade-offs to incur.  Having a connection open, even if idle, consumes
resources and performance no matter how minimal.

Pool management does cost cycles as well so if one does not need pool
management then getting rid of it is probably worthwhile to them.  The
question is whether you want to only meet the need of this specific user or
whether you want to provide them with flexibility.  

If existing pool management implementations are reasonably well implemented
and efficient then focusing effort on a potentially under-served use-case
definitely has merit.

Consider this train-of-thought:  no matter how large the pool size if you
are constantly keeping, say, 90% of the connections actively working then
having, on average, 10% of the connections sitting idle is probably not
going to be noticeable on the server and the reduction in overhead of
managing a pool is typically a net positive.  Now, I had no clue what
percentage is actually true, or under what conditions and pool sizes it may
vary, but that is a calculation that someone deciding on between managed and
un-managed pools would need to make.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Connection-pooling-Number-of-connections-tp5797025p5797030.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query

2014-03-06 Thread David Johnston
Bikram Kesari Naik wrote
> Hi David,
> 
> We have indexes on all the columns which are used in the where clause and
> these tables are linked by foreign key constraint.
> 
> 
> Thanks,
> Bikram
> 
> -Original Message-
> From: 

> pgsql-performance-owner@

>  [mailto:

> pgsql-performance-owner@

> ] On Behalf Of David Johnston
> Sent: Friday, March 07, 2014 11:53 AM
> To: 

> pgsql-performance@

> Subject: Re: [PERFORM] Slow query
> 
> Bikram Kesari Naik wrote
>> Hi,
>>
>> I have a view which joins  multiple tables to give me a result. It
>> takes more than a  minute to give me the result on psql prompt when I
>> select all the data from that view.
>> The single CPU which is used to run this query is utilized 100%.Even
>> if I fire a count(*) it takes 10 Sec. I wanted to know if there is
>> anything we can do to speedup this query below 1 sec.
> 
> In all likelihood you need to index your foreign keys, and possibly other
> fields, but as you haven't provided table and index definitions it is hard
> to say for sure.
> 
> Idepeing on how many rows are hidden I'm not sure an unqualified query on
> this view can run in 1/60th the time even with indexes present - the
> sequential scans are efficient if the proportion of the tables being
> returned is high.
> 
> David J.

Read these.

https://wiki.postgresql.org/wiki/Using_EXPLAIN
https://wiki.postgresql.org/wiki/SlowQueryQuestions

And note, while the FK thing is likely not relevant in this situation
defining a constraint does not cause an index to be created.  Depending on
your usage patterns defining those indexes can be helpful.

One last thought: not only are your row counts high but it seems like your
row sizes may also be large due to them containing binary content.  You
likely need to take a different approach to solving whatever unspecified
problem this query is intended to solve if you need sub-second performance.

That all said the main area of improvement for this is system memory
concerns so, as noted in the links above, play with that and see what
happens.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Slow-query-tp5795077p5795086.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Slow query

2014-03-06 Thread David Johnston
Bikram Kesari Naik wrote
> Hi,
> 
> I have a view which joins  multiple tables to give me a result. It takes
> more than a  minute to give me the result on psql prompt when I select all
> the data from that view.
> The single CPU which is used to run this query is utilized 100%.Even if I
> fire a count(*) it takes 10 Sec. I wanted to know if there is anything we
> can do to speedup this query below 1 sec.

In all likelihood you need to index your foreign keys, and possibly other
fields, but as you haven't provided table and index definitions it is hard
to say for sure.

Idepeing on how many rows are hidden I'm not sure an unqualified query on
this view can run in 1/60th the time even with indexes present - the
sequential scans are efficient if the proportion of the tables being
returned is high.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Slow-query-tp5795077p5795079.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Problem with slow query with WHERE conditions with OR clause on primary keys

2013-12-18 Thread David Johnston
kolsze...@gmail.com wrote
> Thanx for your answer
> 
> My example is trivial because i want to show strange (for me) postgres
> behavior with dealing with primary keys (extreme example), in real
> situation user put search condition e.g.  "Panas" and this generates query
> ...
> where gd.other_code like 'Panas%' OR g.code like 'Panas%'
> ..
> 
> both columns has very good indexes and selectivity for "like 'Panas%'" ...
> 
> I have experience from Oracle with this type of queries, and Oracle have
> no problem with it,
> executes select on index on other_code from gd and join g
> in next step executes select on index on code from g and join gd
> and this two results are connected in last step (like union)
> very fast on minimal cost
> 
> and in my opinion read whole huge tables only for 10 rows in result where
> conditions are very good  ... is strange

I suppose the equivalent query that you'd want would be:

SELECT ... FROM gd JOIN gd_data USING (id_gd)
WHERE id_gd IN (

SELECT id_gd FROM gd WHERE ...
UNION ALL -distinct not required in this situation
SELECT id_gd FROM gd_data WHERE ...

) --ignoring NULL implications

It does make sense conceptually...

David J.






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Problem-with-slow-query-with-WHERE-conditions-with-OR-clause-on-primary-keys-tp5782803p5783942.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Problem with slow query with WHERE conditions with OR clause on primary keys

2013-12-10 Thread David Johnston
Krzysztof Olszewski wrote
> Hi,
> 
> my sql is very simple,
> returns one row,
> where conditions are assigned to primary keys
> 
> 
> */select g.gd_index, gd.full_name/**/
> /**/from gd g join gd_data gd on (g.id_gd = gd.id_gd)/**/
> /**/where gd.id_gd_data =  OR g.id_gd = ;/*
> 
> 
> but generates "crazy" plan with Merge Join on big amount of rows (both
> tables contains 50 rows)
> because Index scans ignore conditions, conditions are processed after
> index sacans on Merge Join
> 
> */Merge Join  (cost=0.00..46399.80 rows=2 width=115) (actual
> time=3.881..644.409 rows=1 loops=1)/**/
> /**/   Merge Cond: (g.id_gd = gd.id_gd)/**/
> /**/   Join Filter: ((gd.id_gd_data = ) OR (g.id_gd = ))/**/
> /**/   ->  Index Scan using pk_gd on gd g  (cost=0.00..14117.79
> rows=51 width=40) (actual time=0.019..146.521 rows=51 loops=1)/**/
> /**/   ->  Index Scan using fki_gd on gd_data gd  (cost=0.00..22282.04
> rows=51 width=99) (actual time=0.016..157.384 rows=51 loops=1)/**/
> /**/Total runtime: 644.460 ms/*
> 
> 
> model is very simple
> 
> 
> /CREATE TABLE gd (//
> //   id_gd bigint NOT NULL,//
> //   gd_index character varying(60) NOT NULL,//
> //   notes text,//
> //   notes_exists integer NOT NULL DEFAULT 0,//
> //   CONSTRAINT pk_gd PRIMARY KEY (id_gd )//
> //)//
> //
> //
> //CREATE TABLE gd_data (//
> //   id_gd_data bigint NOT NULL,//
> //   id_gd bigint NOT NULL,//
> //   short_name character varying(120) NOT NULL,//
> //   full_name character varying(512) NOT NULL,//
> //   notes text,//
> //   notes_exists integer NOT NULL DEFAULT 0,//
> //   CONSTRAINT pk_gd_data PRIMARY KEY (id_gd_data ),//
> //   CONSTRAINT fk_gd FOREIGN KEY (id_gd)//
> //   REFERENCES gd (id_gd) MATCH SIMPLE//
> //   ON UPDATE NO ACTION ON DELETE NO ACTION//
> //)//
> //
> //CREATE INDEX fki_gd//
> //   ON gd_data//
> //   USING btree//
> //   (id_gd );//
> /
> 
> 
> my configuration from (select * from pg_settings):
> 
> "server_version";"9.1.10"
> 
> Thank you for your help.
>   
> 
> Kris Olszewski

It cannot do any better since it cannot pre-filter either table using the
where condition without risking removing rows that would meet the other
table's condition post-join.

The query you are executing makes no sense to me: I don't understand why you
would ever filter on gd.id_gd_data given the model you are showing.

I believe your understanding of your model - or the model itself - is flawed
but as you have only provided code it is impossible to pinpoint where
exactly the disconnect resides.  You can either fix the model or the query -
the later by implementing sub-selects with where clauses manually - which
then encodes an assumption about your data that the current query cannot
make.

Your model implies that a single gd record can have multiple gd_data records
associated with it.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Problem-with-slow-query-with-WHERE-conditions-with-OR-clause-on-primary-keys-tp5782803p5782822.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] One huge db vs many small dbs

2013-12-05 Thread David Johnston
maxabbr wrote
> Hello,
> 
> We are starting a new project to deploy a solution in cloud with the
> possibility to be used for 2.000+ clients. Each of this clients will use
> several tables to store their information (our model has about 500+ tables
> but there's less than 100 core table with heavy use). Also the projected
> ammout of information per client could be from small (few hundreds
> tuples/MB) to huge (few millions tuples/GB).
> 
> One of the many questions we have is about performance of the db if we
> work with only one (using a ClientID to separete de clients info) or
> thousands of separate dbs. The management of the dbs is not a huge concert
> as we have an automated tool.
> 
> At Google there's lots of cases about this subject but none have a
> scenario that matchs with the one I presented above, so I would like to
> know if anyone here has a similar situation or knowledgement and could
> share some thoughts.
> 
> 
> Thanks
> 
> Max

My untested thoughts here is a hybrid approach.  Allow any one database to
contain any number of stores on a common schema with a controlling clientId
column.  But allow for multiple databases.  Furthermore, any non-client
shared data could belong to a separate database of reference with the
possibility of caching said data in each of the client databases where
applicable.

Though until your needs dictate that level of complexity you can have just
one data and schema set for all clients.

While row-level-security will make this more tenable generally this model
works best if all client access is made via middleware.  You mitigate that
by using separate databases for any clients with a higher risk profile
(i.e., larger datasets, direct access to the DB, etc...)

Adding in clientId overhead will degrade performance somewhat but increase
your flexibility considerably.  That is often a worthwhile trade-off to make
even if you decided to create separate schemas/databases.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/One-huge-db-vs-many-small-dbs-tp5781827p5781924.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Adding foreign key constraint holds exclusive lock for too long (on production database)

2013-10-29 Thread David Johnston
Ben Hoyt wrote
> * http://www.postgresql.org/message-id/

> 51A11C97.90209@

>  --
> indicates that the db ignores the index when add constraints

As noted in the referenced thread (and never contradicted) the current
algorithm is "for each record does the value in the FK column exist in the
PK table?" not "do all of the values currently found on the FK table exist
in the PK table?".  The later question being seemingly much faster (if table
statistics imply a small-ish number of bins and the presence of an index on
the column) to answer during a bulk ALTER TABLE but the former being the
more common question - when simply adding a single row.

You need to figure out some way to avoid continually evaluating the FK
constraint on all 20M row - of which most of them already were previously
confirmed.  Most commonly people simply perform an incremental update of a
live table and insert/update/delete only the records that are changing
instead of replacing an entire table with a new one.  If you are generally
happy with your current procedure I would probably continue on with your
"live" and "content" schemas but move this table into a "bulk_content"
schema and within that have a "live" table and a "staging" table.  You can
drop/replace the staging table from your office database and then write a
routine to incrementally update the live table.  The FK references in live
and content would then persistently reference the "live" table and only the
subset of changes introduced would need to be checked.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Adding-foreign-key-constraint-holds-exclusive-lock-for-too-long-on-production-database-tp5776313p5776315.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 57 minute SELECT

2013-10-02 Thread David Johnston
Samuel Stearns-2 wrote
> EXPLAIN:
> 
> QUERY PLAN
>  
> ---
>  Hash Join  (cost=408.53..1962721.39 rows=98068 width=126) (actual
> time=30121.265..3419306.752 rows=1929714 loops=1)

You are selecting and returning 2 million records...how fast do you want
this to run?  For some reason I read 57 seconds initially - I guess 57
minutes is a bit much...but the most obvious solution is RAM.

Might want to include buffers output in the explain as well but:

I'm doubting the contents of your result fit into the server memory so your
disk is involved which will severely slow down processing.

Hopefully someone more knowledgeable and experience will chime in to help
you.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/57-minute-SELECT-tp5773169p5773187.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] 57 minute SELECT

2013-10-02 Thread David Johnston
Samuel Stearns-2 wrote
> Total RAM - 1G
> 
> 
> 1.   Explain on SELECT.

So either this is a typo (1 GB of RAM) or your query is likely ending up I/O
bound.

You should probably provide EXPLAIN and EXPLAIN (ANALYZE) output since even
with the schema it is impossible for someone to see what the planner is
proposing for a multiple-million record source table that is going to be
empty if all someone does is create the schema.

For my money it is also helpful to actual write some prose describing what
you are providing and seeing and not just toss some settings and schema out
there.

http://wiki.postgresql.org/wiki/SlowQueryQuestions

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/57-minute-SELECT-tp5773169p5773174.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] ORDER BY, LIMIT and indexes

2013-08-05 Thread David Johnston
Sergey Konoplev-2 wrote
> As an alternative solution for pagination (OFFSET) problem you might
> also use the "prev/next" technique, like
> 
> SELECT * FROM table
> WHERE id > :current_last_id
> ORDER BY id LIMIT 10
> 
> for "next", and
> 
> SELECT * FROM (
> SELECT * FROM table
> WHERE id < :current_first_id
> ORDER BY id DESC
> LIMIT 10
> ) AS sq ORDER BY id
> 
> for "prev". It will be very fast.

Even being fairly experienced at SQL generally because I haven't explored
pagination that much my awareness of the OFFSET issue led me to conclude bad
things.  Thank you for thinking to take the time for a brief moment of
enlightenment of something you likely take for granted by now.

Curious how much slower/faster these queries would run if you added:

SELECT *, first_value(id) OVER (...), last_value(id) OVER (...)  
--note the window specifications need to overcome the "ORDER BY" limitation
noted in the documentation.

to the query.  Using the window functions you know at each record what the
first and last ids are for its window.  Applicability would be
application/need specific but it would avoid having to calculate/maintain
these two values in a separate part of the application.

David J.





--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/ORDER-BY-LIMIT-and-indexes-tp5766413p5766429.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[PERFORM] Re: Seq Scan vs Index on Identical Tables in Two Different Databases

2013-07-17 Thread David Johnston
Ellen Rothman wrote
> I have the same table definition in two different databases on the same
> computer. 

You really should prove this to us by running schema commands on the table
and providing results.

Also, version information has not been provided and you do not state whether
the databases are the same as well as tables.  And, do those tables have
identical data or just structure?


> When I explain a simple query in both of them, one database uses a
> sequence scan and the other uses an index scan.

Corrupt index maybe?  Or back to the first point maybe there isn't one.


> If I try to run the Seq Scan version without the where clause restricting
> the value of uniqueid, it uses all of the memory on my computer and never
> completes.

How are you running this and how are you defining "never completes"?

Can you run this but with a limit clause so your client (and the database)
does not try to display 3 millions rows of data?


> How can I get the Seq Scan version to use an index scan?

Re-Index (or drop/create even)

Also, you should always try to provide actual queries and not just explains. 
Since you are getting "Aggregate" nodes you obviously aren't running a
simple "SELECT * FROM publcoop_ext [WHERE ...]".

Ideally you can also provide a self-contained test case. though your
scenario seems simple enough that either:

1) You didn't run analyze
2) Your table and/or index is corrupt
3) You do not actually have an index on the table even though you claim they
are the same







--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Seq-Scan-vs-Index-on-Identical-Tables-in-Two-Different-Databases-tp5764125p5764143.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] My changes in the postgresql.conf does not work

2013-07-03 Thread David Johnston
David Carpio wrote
> I have made some changes in my postgresql.conf, well, I made two changes 
> in this file. the first time, my queries had improved on their execution 
> time considerably but in the second change, I seem my queries have not 
> improved on the contrary they have come back to be slow or at best, they 
> have not changed in its previous improvement.
> 
> These are my changes:
> 
> + shared_buffers = 4GB.
> + bgwriter_lru_maxpages = 250.
> + synchronous_commit = off.
> + effective_io_concurrency = 3.
> + checkpoint_segments = 64.
> + checkpoint_timeout = 45min
> + logging_collector = on
> + log_min_duration_statement = 500
> + log_temp_files = 0.
> 
> my max connections are 150
> 
> Please, what would be my error?
> 
> Thank you for the tips,
> 
> David Carpio

It might increase the likelihood of a meaningful response if you include:

1) The default for each of the parameters
2) The value you used foe each parameter in the situation where performance
improved

Very few people memorized the first and it would be interesting to have the
second for reference.

Also,

Performance questions are hardware/software specific.  What are you running
and how are you testing?

You should also do some reading about performance question posting and
performance in general:

https://wiki.postgresql.org/wiki/SlowQueryQuestions

One question is whether you really want to make the default
"synchronous_commit" setting to be "off"?  Can you even measure the actual
difference in your specific use-case that turning this off makes or are you
just throwing stuff against the wall that say "this can improve performance"
and hoping things work out for the best?  Since it can be turned on/off on a
per-transaction basis you should generally try to only have it off in areas
that are meaningful and were you've acknowledged the corresponding
additional risk specific to that area.

David J.






--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/My-changes-in-the-postgresql-conf-does-not-work-tp5762369p5762418.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Not same plan between static and prepared query

2013-06-09 Thread David Johnston
Yuri Levinsky wrote
>> We have two equals queries where the sole difference is in the limit.
>> - The first is hard coded with limit 500.
>> - The second is prepared with limit $1 ($1 is bound to 500).
> 
> 
>> PostgreSQL give us two different plans with a huge execution time for 
>> the
> prepared query:
> 
> It can generate different plan for prepared query, because optimizer uses
> default selectivity in case of bound parameters (in your case limit $1).
> 
> 
>> We met the same behaviour with both :
>> - PostgreSQL 8.4.8 on Windows 2008 (Prod)
>> - PostgreSQL 8.4.8 and 8.4.17 on Windows 7 (Dev)

So the planner knows it needs a limit in both cases yet for the second
situation it has no idea what the limit value will be.  For a sufficiently
large value of LIMIT it will conclude that a sequential scan will be optimal
and so that is what the plan uses.  However, knowing the limit is only going
to be 500 it is able to conclude that an index scan will work better.


> From PostgreSQL 9.2, it generates plan for prepared query during execution
> (Execute command) as well.
> So I think you will not face this problem in PostgreSQL 9.2 and above.

See:

http://www.postgresql.org/docs/9.2/interactive/release-9-2.html

Section E.5.3.1.3 (First Bullet)

Someone more knowledgeable than myself will need to comment on how the
performance impact was overcome but my guess is that update statements
likely avoid this behavior if the where clauses are equality conditions
since indexes (if available) are going to be the most efficient plan
regardless of the specific values.  Its when, in cases like this, the
planner knows the specific value of LIMIT will matter greatly that it is
going to need to use a run-time plan.  Whether during the PREPARE phase the
planner tags the resultant plan with some kind of "allow runtime plan" flag
I do not know though so maybe the first few executions will always use
run-time plans and only after N executes does the cached plan come into
effect.

Its probably worth a search and read of the mailing list but I cannot do so
at this moment.

David J.




--
View this message in context: 
http://postgresql.1045698.n5.nabble.com/Not-same-plan-between-static-and-prepared-query-tp5758115p5758516.html
Sent from the PostgreSQL - performance mailing list archive at Nabble.com.


-- 
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance