[GENERAL] How to generate are own apt packages for ubuntu?

2016-02-23 Thread David Grelaud
Hello,

We would like to deploy a modified version of PostgreSQL on our servers.
It would be easier if we could build our own apt package for Ubuntu 14.04.

We've found this project  :
http://git.postgresql.org/gitweb/?p=pgapt.git;a=summary
Is it the right tool? How to use it (without jenkins if possible)?
Is there a better solution?

Thank you,

*David Grelaud*


Re: [GENERAL] PostgreSQL vs Firebird SQL

2016-02-09 Thread David Grelaud
I confirm what Josh berkus said. The performance will not be a problem as
soon as you manage your database very well (good configuration, good
hardware, good queries, good data organisation...), like a lot of other
competitors I think?

We have never used Firebird but we use intensively PostgreSQL since 2011 on
thirty servers (independently).
All our servers run an application which have 150 tables, more than 500
handmade queries, where more than 30 % are really complex (more than 2000
lines of code), using a lot of features (json, hstore, full text search,
window functions, custom types, custom window functions, all kind of
indexes, recursive queries, stored procedures, triggers, locks, a lot of
CTEs, range types, arrays types...).
One of our server, which runs Ubuntu Server on a virtualized machine (4
cores, 16 Go RAM), has more than 100 millions of rows with more or less the
same inserts, updates and deletes every day as you.
Hopefully, we've never experienced a data corruption until now ("crossed
fingers").

*David Grelaud*

2016-02-10 8:06 GMT+01:00 Josh berkus <j...@agliodbs.com>:

> On 02/10/2016 05:10 AM, ioan ghip wrote:
>
>> I have a Firebird SQL database running on one of my servers which has
>> about 50k inserts, about 100k updates and about 30k deletes every day.
>> There are about 4 million records in 24 tables. I have a bunch of stored
>> procedures, triggers, events and views that I'm using.
>> Firebird works fairly well, but from time to time the database gets
>> corrupted and I couldn't figure out yet (after many years of running)
>> what's the reason. When this happens I run "gfix -mend -full -ignore",
>> backup and restore the db and everything is fine until next problem in a
>> week, or a month.
>>
>> I never used PostgreSQL. Yesterday I installed it on my development
>> machine and after few tests I saw that it's fairly easy to use.
>>
>> Does anyone have experience with both, Firebird and PostgreSQL? Is
>> PostgreSQL way better performing than Firebird? Is it worth the effort
>> moving away from Firebird? Would I gain stability and increased
>> performance?
>>
>
> Well, performance in PostgreSQL is largely dependant on your hardware. The
> numbers you're talking about seem pretty small to me, though; I can do 1000
> inserts per *second* on a medium-sized AWS instance.  So I don't think
> performance will be your main concern.
>
> I'm sorry to hear about your data corruption issues on Firebird.  That's
> dissapointing, especially since Firebird was one of the champion early open
> source databases.  Proof against database corruption is a major part of
> PostgreSQL.  I suggest turning on data checksums when you create your
> database (this is not the default option) just in case the corruption issue
> is actually your hardware.
>
> PostgreSQL is a *server* database, though, so managing it is going to be
> fairly different from Firebird, which is primarily a desktop database. I
> suggest looking into pgAdmin4 to help with that.
>
> --
> --
> Josh Berkus
> Red Hat OSAS
> (any opinions are my own)
>
>
> --
> Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-general
>


Re: [GENERAL] Why PG uses nested-loop join when no indexes are available?

2016-01-20 Thread David Grelaud
Thank you both for your help.

We will test your patch but we need to understand a bit more the code in
order to follow your discussions.
Actually, your patch helps us to find where to start in the code ;).

> The planner is never going to get it right 100% of the time.

Yes, I agree.
In production environnements, even if PostgreSQL chooses such a bad plan 1%
of the time, it is enough to make clients angry. My goal is to eradicate
this risk of choosing a nested loop in certain cases, which freezes
PostgreSQL during many minutes, whereas a hash-join or something else takes
only 2 seconds to complete. The performance difference is huge.
I mean, even if the plan is not the best one 100% of the time, it should at
least choose a "risk-free" plan, without these "bad" nested-loops. It is
maybe easier said than done but we want to try.

Regards,

*David Grelaud*

2016-01-15 2:16 GMT+01:00 David Rowley <david.row...@2ndquadrant.com>:

> On 15 January 2016 at 04:00, Tom Lane <t...@sss.pgh.pa.us> wrote:
>
>> David Rowley <david.row...@2ndquadrant.com> writes:
>> > Perhaps separating out enable_nestloop so that it only disables
>> > non-parameterised nested loops, and add another GUC for parameterised
>> > nested loops would be a good thing to do. Likely setting
>> enable_nestloop to
>> > off in production would be a slightly easier decision to make, if that
>> was
>> > the case.
>> > It looks pretty simple to do this, so I hacked it up, and attached it
>> here.
>> > There's no doc changes and I'm not that interested in fighting for this
>> > change, it's more just an idea for consideration.
>>
>> I'm not terribly excited by this idea either.  If making such a change
>> actually makes things better for someone consistently, I'd argue that
>> the problem is a mistaken cost estimate elsewhere, and we'd be better off
>> to find and fix the real problem.  (There have already been discussions
>> of only believing single-row rowcount estimates when they're provably
>> true, which might help if we can figure out how to do it cheaply enough.)
>>
>
> Actually, it's not very hard to hit a bad underestimate at all. All you
> need is a join on two columns which are co-related. Since PostgreSQL
> multiplies the estimated selectivities the row count is going to come out
> too low. This also tricks the planner into thinking that this is a good
> join to perform early, since (it thinks that) it does not produce many rows
> at all. You only need 1 more join to occur after that to choose a nested
> loop join mistakenly to hit the issue.
>
> FWIW TPC-H Q9 has this exact trip hazard with the partsupp table, which is
> the exact reason why this patch was born:
> https://commitfest.postgresql.org/7/210/
>
> I also think that the attitude that we can *always* fix the costs and
> estimates is not the right one. The planner is never going to get it right
> 100% of the time. If we ever think we can build such a planner then someone
> needs to come along and direct us back into the real world.
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
> <http://www.2ndquadrant.com/>
>  PostgreSQL Development, 24x7 Support, Training & Services
>


[GENERAL] Why PG uses nested-loop join when no indexes are available?

2016-01-13 Thread David Grelaud
Hi,

Statistics are not propagated when Common Table Expressions (CTE) are used.
The cardinality of a CTE is equal to 1 most of the time so every joins with
previously computed CTEs are done with the nested-loop algorithm.
This seems to be really a risky choice, even without CTEs, according to
this paper and our own experiments:

"How good are query optimizers, really?." Proceedings of the VLDB Endowment
9.3 (2015): 204-215. (Paragraph 4.1) Leis, Viktor, et al.
http://www.vldb.org/pvldb/vol9/p204-leis.pdf

There are interesting discussions on the web about CTEs and bad
performances:

-
http://www.postgresql.org/message-id/flat/col116-w25f8931477407ed7689d69a3...@phx.gbl#col116-w25f8931477407ed7689d69a3...@phx.gbl
- http://blog.2ndquadrant.com/postgresql-ctes-are-optimization-fences/
- ...

So when the problem happens (underestimation costs -> nested-loop ->  many
rows -> bad performance guarantee), we have currently these solutions:

- refactor the query using Subquery Expressions instead of CTEs but the
query looks really ugly to read (increasing maintenance cost), and we may
loose some other execution plan optimisations provided by CTEs
- refactor the query using temporary table but it becomes impossible to use
single-query prepared statement
- disable nested loop but PostgreSQL does not use Indexes anymore when
available
- use an extension to enable Oracle-style hints (
https://fr.osdn.jp/projects/pghintplan/) but the system becomes blindness
(not data-dependent, potential futures algorithms never used, ...)
- is there another existing solution I'm not aware of?

I'm sure PostgreSQL could provide a better solution to solve this problem.


1) Would it be easy to compute and propagate statistics of CTEs, like
subqueries?

The argument usually returned by the PostgreSQL community is:
"CTEs have been created to let the developer control the execution plan, so
the statistics computation is virtually disabled"
Ok, the developer can control roughly the execution plan but in the same
time, Postgres becomes "stupid" inside each CTEs and chooses always the
"same" join algorithm (the riskiest) to join previously computed CTEs.
It is like giving to somebody the power to fly, while removing his eyes ;).

Drawbacks: even if statistics are computed and propagated across CTEs, and
if queries are really complex, the cost estimator may fail to compute
cardinality and the problem of nested-loop joins still happens.


2) Would it be possible to let the developer inject cardinality hints in
the query?

As suggested by this paper:
"Query optimizers: time to rethink the contract?."" In : Proceedings of the
2009 ACM SIGMOD International Conference on Management of data. ACM, 2009.
p. 961-968. CHAUDHURI, Surajit.
http://courses.cs.washington.edu/courses/csep544/10au/readings/p961-chaudhuri.pdf

The SQL developer could for example inject cardinality in a comment
"my_cte:10". The developer is responsible to update this cardinality
with its own metrics and tools.
Thus, the cardinality is still data-dependent (not constant Ad vitam
æternam) and the planner is able to choose the best join algorithm
according to all other parameters (system IO...).


3) Always avoid nested-loop join when no indexes are available?

Tom Lane said "There might be some cases where this would help, but there
would be many more where it would be useless or counterproductive."
Who is right between Tom Lane and the Leis Viktor's paper above?

We tried to disable nested_loop all the time in a production environment
and we observed an overall improvement in all queries where Indexes are not
useful or not available (CTEs), which confirms the paper.
In fact, one of our production environment is still running with
"nested_loop off" because benefits are a lot greater than drawbacks as long
as some tables are relatively small (Indexes not used).


4) Do runtime optimizations?

According to research papers, this will be the next challenge. But I think
it is difficult to implement it in a relatively short-term period?



What is the purpose of this message:

We would like to find a "simple" long-term solution to this
under-estimation cost problem, which generate hazarduous performance
regressions in our production environments.

We would like to hear critiques or other solutions from PostgreSQL experts.

We would like to help developing and testing the solution.


Thank you very much!

Regards,
---
David Grelaud,
Ideolys.


Re: [GENERAL] Why PG uses nested-loop join when no indexes are available?

2016-01-13 Thread David Grelaud
Thank you for your fast response!

I'm sorry, my vocabulary may be not correct and my "french approach" to
explain my problem is not efficient ;).

But the underestimation problem in complex queries is still there? ... and
generates potential "dangerous" plans (nested loop).

We thought at the beginning we were alone but it seems to be a problem of
most database systems.
What do you think about the paragraph 4.1 of this paper
http://www.vldb.org/pvldb/vol9/p204-leis.pdf ?

Regards,
-------
David Grelaud,
Ideolys.



2016-01-13 16:02 GMT+01:00 Tom Lane <t...@sss.pgh.pa.us>:

> David Grelaud <dgrel...@ideolys.com> writes:
> > Statistics are not propagated when Common Table Expressions (CTE) are
> used.
> > The cardinality of a CTE is equal to 1 most of the time
>
> Really?
>
> The rest of this seems to be proceeding from completely false assumptions.
>
> regards, tom lane
>