RE: PostgreSQL suitable?

2017-12-19 Thread Kellner Thiemo
Hi

Thank you all for your thoughts and the light shedding on this subject. I am 
very pleased to have got such engaged feedback and also that PostgreSQL seems 
to be so capable. I am quite sad that we most probably will not setup our DWH 
with it as RDBMS. It is, I fear, a quite in-house-political decision.

Kind regards

Thiemo



Re: PostgreSQL suitable?

2017-12-19 Thread Vincenzo Romano
Hi.

2017-12-19 17:32 GMT+01:00 Stephen Frost :
> Greetings,
>
> * Vincenzo Romano (vincenzo.rom...@notorand.it) wrote:
>> 2017-12-19 16:48 GMT+01:00 Stephen Frost :
>> > * Vincenzo Romano (vincenzo.rom...@notorand.it) wrote:
>> >> Sorry, my bad: I confused V10 with v11.
>> >> But accordingly to a discussion with Bruce Momjan, table partitionin V10 
>> >> is
>> >> little more than syntactic sugar around old-fashioned table partitioning.
>> >
>> > Well, it's a bit more than that since there's tuple-routing, but you're
>> > right that the partition elimination is the same as it was in earlier
>> > versions and based on constraint exclusion.  That said, as noted in the
>> > email you replied to, reasonable numbers of partitions aren't too bad
>> > even with the planning cost; it's when you have many thousands of
>> > partitions that you get into cases where planning time for queries is
>> > really bad.
>>
>> When you have to handle a 100TB table, the number of partitions
>> shouldn't be 10 or 12
>> as seen in most examples and tests.
>
> Probably not, but it would depend on the data and what you're doing with
> it.  While there are some general rules-of-thumb, there's no one right
> answer when it comes to the size of individual partitions.

I think partitioning is done in order to make smaller tables or to
balance table size and table number.
You currently have to keep in mind a limit to the number of sub-tables.
This makes partitioning less effective.

>> This is the same type of issues you hit with partial indexes (this is
>> why I mentioned them earlier).
>> Sub-table (and partial index) selection algorithm should be
>> logarithmic or sub-linear.
>
> Sure, and work is being done to improve PG in that exact area, but that
> doesn't mean it can't handle workloads like this today, but you have to
> be aware of the costs associated with today's partitions.
>
>> As long as it'll be linear, you'll hear about "reasonable number of 
>> partitions".
>> One thousand partitions for a 100TB table would make "manageable"
>> 100GB sub-tables.
>
> Yup, and 100G tables are certainly large and a bit awkward but they can
> be managed.

Of course. But I would bet they are still considered as "very large tables".

>> I could be easily wrong, but this is an are where PG needs improvements.
>
> Of course, and work is being done to improve it.  What's important is
> knowing that there's a cost to having more partitions when querying
> through the parent when you get to a point where you have thousands of
> partitions.  That cost may be perfectly fine in some use-cases and in
> others it might not be, and instead you'd likely have to build logic
> into the application layer to address it.  That's not ideal, which is
> why there's work being done to improve PG, but it's not necessairly a
> big deal either.  Dynamic SQL is another approach.
>
>> One could maybe think about multi-level partitioning, though.
>
> Not sure exactly what you're referring to here, but a simple multi-level
> partitioning setup with PG wouldn't actually change the cost for
> partition elimination today.

It'd be a tree of tables, pointing to logarithmic selection.
This is why we love tree data structures.

>> > Also as noted on this thread, PG could handle this data volume, but to
>> > be efficient there would be work to be done in normalization,
>> > aggregation, and analyzing the system to ensure you're storing and
>> > querying on the data efficiently.
>>
>> Normalization will grow the number of tables (and later joins) and
>> you'll will very likely end up with at least a table with a
>> "gazillion" rows.
>
> Natuarlly, but that "gazillion" rows table would be much smaller for
> having the data normalized- if you don't normalize it then the gazillion
> row table is a huge amount of duplicated data, making the entire system
> much larger than necessary.

Partitioning is done for other reasons than de-duplicating data.
The number of rows to be analyzed would be still the same and the
indexes over those columns would still be rather large.

>> I fear normalization, provided it's really needed, would provide little help.
>
> I seriously doubt that's the case.  Normalization might reduce that
> 100TB down to 10's of TB instead, or perhaps even smaller.

A 10x duplication factor seems very bad to me.
If that's the case then normalization would provide for some
improvement while eating resources to be done.
Storage nowadays isn't a big issue. Querying is.

>> With tables that big I usually do "software partitioning".
>> I make the software aware of the partition schema so it can direct the
>> queries to the needed (sub-)tables.
>
> Yes, implementing partition elimination at the application level is
> certainly an approach that can work with PG.  Even then, ideally, you'd
> normalize the data so that the footprint is much smaller, making more
> efficient use of the storage and memory available.

Normalizing can save 

Re: PostgreSQL suitable?

2017-12-19 Thread Stephen Frost
Greetings,

* James Keener (j...@jimkeener.com) wrote:
> Would a storage block level incremental like zfs work?

This really depends on what you want out of your backups and just
exactly how the ZFS filesystem is set up.  Remember that any backup of
PG that doesn't use PG's start/stop backup must be atomic across all
tablespaces and even then that really just allows you to bring PG back
up as of that point of the snapshot.  I wouldn't recommend trying to
play WAL forward from that kind of a backup.  If you use do use
pg_start/stop_backup with ZFS snapshots, and make sure to track all of
the WAL that's generated between the start/stop backup and ensure it's
reliably stored, etc, then they can work, but it's not simple.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: PostgreSQL suitable?

2017-12-19 Thread Stephen Frost
Greetings Rakesh,

* Rakesh Kumar (rakeshkumar...@mail.com) wrote:
> 
> > There are multiple solutions to doing incremental backups with
> > PostgreSQL, so I'm not sure why you're saying that they don't exist,
> > because that's really not accurate.
> 
> PG's incremental backup is essentially WAL logs applied to a point in time.  
> I am talking about true incremental backup where the db backs up only the 
> changed blocks after the last full backup.  In a DW system where most of the 
> time it is append only, it makes a huge difference in backup time.
> 
> I believe there is one tool which looks at mtime/atime of each of the data 
> file and takes the decision to back it up or lot.  Not sure how robust it is.

Yes, there are multiple such solutions including pgBackRest and barman,
both of which are robust, well supported projects that provide exactly
the incremental backup capability you're asking about.  I can't speak
for barman, but I know that there are many people using pgBackRest
to do incremental backups, some on very large PG databases.

I will point out that supporting incremental backups is tricky to do
correctly, which is why using an existing project that's had years to
work on getting it right is really the best approach.  I wouldn't
recommend trying to roll your own.

Thanks!

Stephen


signature.asc
Description: Digital signature


Re: PostgreSQL suitable?

2017-12-19 Thread Vincenzo Romano
2017-12-19 16:48 GMT+01:00 Stephen Frost :
> * Vincenzo Romano (vincenzo.rom...@notorand.it) wrote:
>> Sorry, my bad: I confused V10 with v11.
>> But accordingly to a discussion with Bruce Momjan, table partitionin V10 is
>> little more than syntactic sugar around old-fashioned table partitioning.
>
> Well, it's a bit more than that since there's tuple-routing, but you're
> right that the partition elimination is the same as it was in earlier
> versions and based on constraint exclusion.  That said, as noted in the
> email you replied to, reasonable numbers of partitions aren't too bad
> even with the planning cost; it's when you have many thousands of
> partitions that you get into cases where planning time for queries is
> really bad.

When you have to handle a 100TB table, the number of partitions
shouldn't be 10 or 12
as seen in most examples and tests.
This is the same type of issues you hit with partial indexes (this is
why I mentioned them earlier).
Sub-table (and partial index) selection algorithm should be
logarithmic or sub-linear.
As long as it'll be linear, you'll hear about "reasonable number of partitions".
One thousand partitions for a 100TB table would make "manageable"
100GB sub-tables.
I could be easily wrong, but this is an are where PG needs improvements.
One could maybe think about multi-level partitioning, though.

> Also as noted on this thread, PG could handle this data volume, but to
> be efficient there would be work to be done in normalization,
> aggregation, and analyzing the system to ensure you're storing and
> querying on the data efficiently.

Normalization will grow the number of tables (and later joins) and
you'll will very likely end up with at least a table with a
"gazillion" rows.
I fear normalization, provided it's really needed, would provide little help.
With tables that big I usually do "software partitioning".
I make the software aware of the partition schema so it can direct the
queries to the needed (sub-)tables.
So I will have the sub-table selection algorithm in my language of
choice with effective support data structures.
Of course this is far from being the right solution.

-- 
Vincenzo Romano - NotOrAnd.IT
Information Technologies
--
NON QVIETIS MARIBVS NAVTA PERITVS



Re: PostgreSQL suitable?

2017-12-19 Thread Rakesh Kumar

> There are multiple solutions to doing incremental backups with
> PostgreSQL, so I'm not sure why you're saying that they don't exist,
> because that's really not accurate.

PG's incremental backup is essentially WAL logs applied to a point in time.  I 
am talking about true incremental backup where the db backs up only the changed 
blocks after the last full backup.  In a DW system where most of the time it is 
append only, it makes a huge difference in backup time.

I believe there is one tool which looks at mtime/atime of each of the data file 
and takes the decision to back it up or lot.  Not sure how robust it is.



Re: PostgreSQL suitable?

2017-12-19 Thread Andreas Kretschmer



Am 19.12.2017 um 15:58 schrieb Vincenzo Romano:
But accordingly to a discussion with Bruce Momjan, table partitionin 
V10 is little more than syntactic sugar around old-fashioned table 
partitioning.


there is always room for improvements, but table-partitioning in pg10 is 
much, much better then before.
(there was a impressive talk about that: 
https://www.postgresql.eu/events/schedule/pgconfeu2017/session/1643-everything-wrong-with-partitioning-in-postgresql-10/ 
)


Regards, Andreas

--
2ndQuadrant - The PostgreSQL Support Company.
www.2ndQuadrant.com




Re: PostgreSQL suitable?

2017-12-19 Thread Tim Clarke
PG10 is happily in production - in active use here :)

Tim Clarke
IT Director
Manifest

Tel: +44 (0)1376 504510 | Mobile: +44 (0)7887 563420 |
Main: +44 (0)1376 503500 | Fax: +44 (0)1376 503550

Blog: https://blog.manifest.co.uk/  | Web:
https://www.manifest.co.uk/

9 Freebournes Court | Newland Street | Witham | Essex | CM8 2BL | England



Copyright: This e-mail may contain confidential or legally privileged
information. If you are not the named addressee you must not use or
disclose such information, instead please report it to ad...@manifest.info
Legal: Manifest is the trading name of: Manifest Information Services
Ltd: Registered in England Number 3401145 & The Manifest Voting Agency
Ltd: Registered in England Number 2920820 Registered Office at above
address. Please Click Here >>  for
further information.

On 19/12/17 14:50, Vincenzo Romano wrote:
> I've not tested PG10. But it's not released for production yet!
>
> Il 19 dic 2017 15:48, "Andreas Kretschmer"  > ha scritto:
>
>
>
> Am 19.12.2017 um 15:07 schrieb Kellner Thiemo:
>
> Hi
>
> We are developing a data warehouse of which the integration
> layer will start with over 100 TB of data. There are not many
> entities though we probably can partition and foremost we
> should use inheritance for the lab results. I just was
> wondering if PostgreSQL was able to cope with. In case it
> depends on the modelling kind, we have not yet decided between
> classic erd, anchor modelling and data vault.
>
> Does someone have experience with such a set up?
>
> Kind regards
>
> Thiemo
>
>
>
> depends at least on the data and the workload. pg10 contains
> better solutions for table-partitioning, up to 1000 (maybe more)
> child-tables arn't that problem.
> We have customers in that range.
>
>
> Regards, Andreas
>
> -- 
> 2ndQuadrant - The PostgreSQL Support Company.
> www.2ndQuadrant.com 
>
>




smime.p7s
Description: S/MIME Cryptographic Signature


Re: PostgreSQL suitable?

2017-12-19 Thread Alvaro Herrera
Vincenzo Romano wrote:
> I've not tested PG10. But it's not released for production yet!

It definitely is, for a couple of months now.  10.1 (the first bugfix
release) has been out for over a month.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: PostgreSQL suitable?

2017-12-19 Thread Vincenzo Romano
I've not tested PG10. But it's not released for production yet!

Il 19 dic 2017 15:48, "Andreas Kretschmer"  ha
scritto:

>
>
> Am 19.12.2017 um 15:07 schrieb Kellner Thiemo:
>
>> Hi
>>
>> We are developing a data warehouse of which the integration layer will
>> start with over 100 TB of data. There are not many entities though we
>> probably can partition and foremost we should use inheritance for the lab
>> results. I just was wondering if PostgreSQL was able to cope with. In case
>> it depends on the modelling kind, we have not yet decided between classic
>> erd, anchor modelling and data vault.
>>
>> Does someone have experience with such a set up?
>>
>> Kind regards
>>
>> Thiemo
>>
>>
>
> depends at least on the data and the workload. pg10 contains better
> solutions for table-partitioning, up to 1000 (maybe more) child-tables
> arn't that problem.
> We have customers in that range.
>
>
> Regards, Andreas
>
> --
> 2ndQuadrant - The PostgreSQL Support Company.
> www.2ndQuadrant.com
>
>
>