Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Jov
using zfs,you can tune full page write off  for pg,which can save wal write
io.

2016年7月29日 2:05 PM,"trafdev" 写道:

> Hi.
>
> I have an OLAP-oriented DB (light occasional bulk writes and heavy
> aggregated selects over large periods of data) based on Postgres 9.5.3.
>
> Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on ZFS,
> mirror).
>
> The largest table is 13GB (with a 4GB index on it), other tables are 4, 2
> and less than 1GB.
>
> After reading a lot of articles and "howto-s" I've collected following set
> of tweaks and hints:
>
>
> ZFS pools creation:
> zfs create zroot/ara/sqldb
> zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql
>
>
> zfs get primarycache,recordsize,logbias,compression zroot/ara/sqldb/pgsql
> NAME   PROPERTY  VALUE SOURCE
> zroot/ara/sqldb/pgsql  primarycache  all   local
> zroot/ara/sqldb/pgsql  recordsize8Klocal
> zroot/ara/sqldb/pgsql  logbias   latency   local
> zroot/ara/sqldb/pgsql  compression   lz4   inherited from zroot
>
> L2ARC is disabled
> VDEV cache is disabled
>
>
> pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
> pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D /ara/sqldb/pgsql/data"
>
>
> /etc/sysctl.conf
> vfs.zfs.metaslab.lba_weighting_enabled=0
>
>
> postgresql.conf:
> listen_addresses = '*'
> max_connections = 100
> shared_buffers = 16GB
> effective_cache_size = 48GB
> work_mem = 500MB
> maintenance_work_mem = 2GB
> min_wal_size = 4GB
> max_wal_size = 8GB
> checkpoint_completion_target = 0.9
> wal_buffers = 16MB
> default_statistics_target = 500
> random_page_cost = 1
> log_lock_waits = on
> log_directory = 'pg_log'
> log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
> log_destination = 'csvlog'
> logging_collector = on
> log_min_duration_statement = 1
> shared_preload_libraries = 'pg_stat_statements'
> track_activity_query_size = 1
> track_io_timing = on
>
>
> zfs-stats -A
> 
> ZFS Subsystem ReportThu Jul 28 21:58:46 2016
> 
> ARC Summary: (HEALTHY)
> Memory Throttle Count:  0
> ARC Misc:
> Deleted:14.92b
> Recycle Misses: 7.01m
> Mutex Misses:   4.72m
> Evict Skips:1.28b
> ARC Size:   53.27%  32.59   GiB
> Target Size: (Adaptive) 53.28%  32.60   GiB
> Min Size (Hard Limit):  12.50%  7.65GiB
> Max Size (High Water):  8:1 61.18   GiB
> ARC Size Breakdown:
> Recently Used Cache Size:   92.83%  30.26   GiB
> Frequently Used Cache Size: 7.17%   2.34GiB
> ARC Hash Breakdown:
> Elements Max:   10.36m
> Elements Current:   78.09%  8.09m
> Collisions: 9.63b
> Chain Max:  26
> Chains: 1.49m
> 
>
> zfs-stats -E
> 
> ZFS Subsystem ReportThu Jul 28 21:59:57 2016
> 
> ARC Efficiency: 49.85b
> Cache Hit Ratio:70.94%  35.36b
> Cache Miss Ratio:   29.06%  14.49b
> Actual Hit Ratio:   66.32%  33.06b
> Data Demand Efficiency: 84.85%  25.39b
> Data Prefetch Efficiency:   17.85%  12.90b
> CACHE HITS BY CACHE LIST:
>   Anonymously Used: 4.10%   1.45b
>   Most Recently Used:   37.82%  13.37b
>   Most Frequently Used: 55.67%  19.68b
>   Most Recently Used Ghost: 0.58%   203.42m
>   Most Frequently Used Ghost:   1.84%   649.83m
> CACHE HITS BY DATA TYPE:
>   Demand Data:  60.92%  21.54b
>   Prefetch Data:6.51%   2.30b
>   Demand Metadata:  32.56%  11.51b
>   Prefetch Metadata:0.00%   358.22k
> CACHE MISSES BY DATA TYPE:
>   Demand Data:  26.55%  3.85b
>   Prefetch Data:73.13%  10.59b
>   Demand Metadata:  0.31%   44.95m
>   Prefetch Metadata:0.00%   350.48k
>
> zfs-stats -Z
> 
> ZFS Subsystem ReportThu Jul 28 22:02:46 2016
> 
> File-Level Prefetch: (HEALTHY)
> DMU Efficiency:

Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Jov
+1
larger record size can increase compression ratio,so reduce the io.

Did you set atime off for zfs?

2016年9月28日 6:16 AM,"Karl Denninger" 写道:

> On 9/27/2016 16:38, Tomas Vondra wrote:
>
> On 09/27/2016 06:00 PM, Torsten Zuehlsdorff wrote:
>
>
>
> On 29.07.2016 08:30, Tomas Vondra wrote:
>
>
>
> On 07/29/2016 08:04 AM, trafdev wrote:
>
> Hi.
>
> I have an OLAP-oriented DB (light occasional bulk writes and heavy
> aggregated selects over large periods of data) based on Postgres 9.5.3.
>
> Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on ZFS,
> mirror).
>
> The largest table is 13GB (with a 4GB index on it), other tables are 4,
> 2 and less than 1GB.
>
> After reading a lot of articles and "howto-s" I've collected following
> set of tweaks and hints:
>
>
> ZFS pools creation:
> zfs create zroot/ara/sqldb
> zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql
>
>
> zfs get primarycache,recordsize,logbias,compression
> zroot/ara/sqldb/pgsql
> NAME   PROPERTY  VALUE SOURCE
> zroot/ara/sqldb/pgsql  primarycache  all   local
> zroot/ara/sqldb/pgsql  recordsize8Klocal
> zroot/ara/sqldb/pgsql  logbias   latency   local
> zroot/ara/sqldb/pgsql  compression   lz4   inherited from zroot
>
> L2ARC is disabled
> VDEV cache is disabled
>
>
> pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
> pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D /ara/sqldb/pgsql/data"
>
>
> /etc/sysctl.conf
> vfs.zfs.metaslab.lba_weighting_enabled=0
>
>
> postgresql.conf:
> listen_addresses = '*'
> max_connections = 100
> shared_buffers = 16GB
> effective_cache_size = 48GB
>
>
> It may not be a problem for your workload, but this effective_cache_size
> value is far too high.
>
>
> May i asked why? ZFS in default caches your size of RAM minus 1 GB.
> Getting the shared buffer from the 64 GB RAM i would asume 47 GB
> would be a better value. But this would not be far too high. So
> please can you explain this?
>
>
> Because it's not a global value, but an estimate of how much RAM is
> available as a cache for a single query. So if you're running 10 queries at
> the same time, they'll have to share the memory.
>
> It's a bit trickier as there's often a fair amount of cross-backend
> sharing (backends accessing the same data, so it's likely one backend loads
> data into cache, and then other backends access it too).
>
> It also ignores that memory may get allocated for other reasons - some
> queries may allocate quite a bit of memory for sorts/aggregations, so not
> only is
>
>effective_cache_size = RAM - shared_buffers
>
> excessive as it ignores the per-query nature, but also because it neglects
> these other allocations.
>
> regards
>
> You may well find that with lz4 compression a 128kb record size on that
> filesystem is materially faster -- it is here for most workloads under
> Postgres.
>
>
>
> --
> Karl Denninger
> k...@denninger.net
> *The Market Ticker*
> *[S/MIME encrypted email preferred]*
>


Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 6:24 PM, Jake Nielsen 
wrote:

>
>
> On Tue, Sep 27, 2016 at 6:03 PM, Jake Nielsen 
> wrote:
>
>>
>> On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:
>>
>>> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>>>
>>>
>>> the query
>>>
>>> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
>>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
>>> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>>>
>>>
>>>
>>> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
>>> wrote:
>>>
>>> I've got a query that takes a surprisingly long time to run, and I'm
>>> having a really rough time trying to figure it out.
>>>
>>>
>>>
>>> Before I get started, here are the specifics of the situation:
>>>
>>>
>>>
>>> Here is the table that I'm working with (apologies for spammy indices,
>>> I've been throwing shit at the wall)
>>>
>>> Table "public.syncerevent"
>>>
>>> Column|  Type   |Modifiers
>>>
>>>
>>> --+-+---
>>> ---
>>>
>>>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
>>> regclass)
>>>
>>>  userid   | text|
>>>
>>>  event| text|
>>>
>>>  eventid  | text|
>>>
>>>  originatorid | text|
>>>
>>>  propogatorid | text|
>>>
>>>  kwargs   | text|
>>>
>>>  conflicted   | integer |
>>>
>>> Indexes:
>>>
>>> "syncerevent_pkey" PRIMARY KEY, btree (id)
>>>
>>> "syncereventidindex" UNIQUE, btree (eventid)
>>>
>>> "anothersyncereventidindex" btree (userid)
>>>
>>> "anothersyncereventidindexwithascending" btree (userid, id)
>>>
>>> "asdfasdgasdf" btree (userid, id DESC)
>>>
>>> "syncereventuseridhashindex" hash (userid)
>>>
>>>
>>>
>>> To provide some context, as per the wiki,
>>>
>>> there are 3,290,600 rows in this table.
>>>
>>> It gets added to frequently, but never deleted from.
>>>
>>> The "kwargs" column often contains mid-size JSON strings (roughly 30K
>>> characters on average)
>>>
>>> As of right now, the table has 53 users in it. About 20% of those have a
>>> negligible number of events, but the rest of the users have a fairly even
>>> smattering.
>>>
>>>
>>>
>>> EXPLAIN (ANALYZE, BUFFERS) says:
>>>
>>>
>>>   QUERY PLAN
>>>
>>>
>>> 
>>> 
>>> --
>>>
>>>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
>>> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
>>> read=160198
>>>
>>>->  Index Scan using syncerevent_pkey on syncerevent
>>> (cost=0.43..388147.29 rows=1274560 width=615) (actual
>>> time=3352.386..3383.100 rows=4000 loops=1)
>>>
>>>  Index Cond: (id > 12468)
>>>
>>>  Filter: ((propogatorid <> 
>>> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
>>> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>>>
>>>  Rows Removed by Filter: 1685801
>>>
>>>  Buffers: shared hit=120244 read=160198
>>>
>>>  Planning time: 0.833 ms
>>>
>>>  Execution time: 3407.633 ms
>>>
>>> (9 rows)
>>>
>>> If it matters/interests you, here is my underlying confusion:
>>>
>>> From some internet sleuthing, I've decided that having a table per user
>>> (which would totally make this problem a non-issue) isn't a great idea.
>>> Because there is a file per table, having a table per user would not scale.
>>> My next thought was partial indexes (which would also totally help), but
>>> since there is also a table per index, this really doesn't side-step the
>>> problem. My rough mental model says: If there exists a way that a
>>> table-per-user scheme would make this more efficient, then there should
>>> also exist an index that could achieve the same effect (or close enough to
>>> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
>>> utilize at least one of the two indexes on the userId column, but clearly
>>> I'm not understanding something.
>>>
>>> Any help in making this query more efficient would be greatly
>>> appreciated, and any conceptual insights would be extra awesome.
>>>
>>> Thanks for reading.
>>>
>>> -Jake
>>>
>>> --
>>>
>>>
>>>
>>> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
>>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>>>
>>> As does this from the analyze:  Rows Removed by Filter: 1685801
>>>
>>>
>>>
>>> The propogaterid is practically the only column NOT indexed and it’s
>>> used in a “not in”.  It looks like it’s having to do a table scan for all
>>> the rows above the id cutoff to see if any meet the filter requirement.
>>> “not in” can be very expensive.  An index might help on this column.  Have
>>> you tried that?
>>>
>>>
>>>
>>> Your rowcounts aren’t high enough to require partitioning or any other
>>> changes to you

Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 6:03 PM, Jake Nielsen 
wrote:

>
> On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:
>
>> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>>
>>
>> the query
>>
>> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
>> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>>
>>
>>
>> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
>> wrote:
>>
>> I've got a query that takes a surprisingly long time to run, and I'm
>> having a really rough time trying to figure it out.
>>
>>
>>
>> Before I get started, here are the specifics of the situation:
>>
>>
>>
>> Here is the table that I'm working with (apologies for spammy indices,
>> I've been throwing shit at the wall)
>>
>> Table "public.syncerevent"
>>
>> Column|  Type   |Modifiers
>>
>>
>> --+-+---
>> ---
>>
>>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
>> regclass)
>>
>>  userid   | text|
>>
>>  event| text|
>>
>>  eventid  | text|
>>
>>  originatorid | text|
>>
>>  propogatorid | text|
>>
>>  kwargs   | text|
>>
>>  conflicted   | integer |
>>
>> Indexes:
>>
>> "syncerevent_pkey" PRIMARY KEY, btree (id)
>>
>> "syncereventidindex" UNIQUE, btree (eventid)
>>
>> "anothersyncereventidindex" btree (userid)
>>
>> "anothersyncereventidindexwithascending" btree (userid, id)
>>
>> "asdfasdgasdf" btree (userid, id DESC)
>>
>> "syncereventuseridhashindex" hash (userid)
>>
>>
>>
>> To provide some context, as per the wiki,
>>
>> there are 3,290,600 rows in this table.
>>
>> It gets added to frequently, but never deleted from.
>>
>> The "kwargs" column often contains mid-size JSON strings (roughly 30K
>> characters on average)
>>
>> As of right now, the table has 53 users in it. About 20% of those have a
>> negligible number of events, but the rest of the users have a fairly even
>> smattering.
>>
>>
>>
>> EXPLAIN (ANALYZE, BUFFERS) says:
>>
>>
>>   QUERY PLAN
>>
>>
>> 
>> 
>> --
>>
>>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
>> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
>> read=160198
>>
>>->  Index Scan using syncerevent_pkey on syncerevent
>> (cost=0.43..388147.29 rows=1274560 width=615) (actual
>> time=3352.386..3383.100 rows=4000 loops=1)
>>
>>  Index Cond: (id > 12468)
>>
>>  Filter: ((propogatorid <> 
>> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
>> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>>
>>  Rows Removed by Filter: 1685801
>>
>>  Buffers: shared hit=120244 read=160198
>>
>>  Planning time: 0.833 ms
>>
>>  Execution time: 3407.633 ms
>>
>> (9 rows)
>>
>> If it matters/interests you, here is my underlying confusion:
>>
>> From some internet sleuthing, I've decided that having a table per user
>> (which would totally make this problem a non-issue) isn't a great idea.
>> Because there is a file per table, having a table per user would not scale.
>> My next thought was partial indexes (which would also totally help), but
>> since there is also a table per index, this really doesn't side-step the
>> problem. My rough mental model says: If there exists a way that a
>> table-per-user scheme would make this more efficient, then there should
>> also exist an index that could achieve the same effect (or close enough to
>> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
>> utilize at least one of the two indexes on the userId column, but clearly
>> I'm not understanding something.
>>
>> Any help in making this query more efficient would be greatly
>> appreciated, and any conceptual insights would be extra awesome.
>>
>> Thanks for reading.
>>
>> -Jake
>>
>> --
>>
>>
>>
>> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
>> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>>
>> As does this from the analyze:  Rows Removed by Filter: 1685801
>>
>>
>>
>> The propogaterid is practically the only column NOT indexed and it’s used
>> in a “not in”.  It looks like it’s having to do a table scan for all the
>> rows above the id cutoff to see if any meet the filter requirement.  “not
>> in” can be very expensive.  An index might help on this column.  Have you
>> tried that?
>>
>>
>>
>> Your rowcounts aren’t high enough to require partitioning or any other
>> changes to your table that I can see right now.
>>
>>
>>
>> Mike Sofen  (Synthetic Genomics)
>>
>>
>>
>
> Thanks Mike, that's true, I hadn't thought of non-indexed columns forcing
> a scan. Unfortunately, just to test this out, I tried pu

Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
On Tue, Sep 27, 2016 at 5:41 PM, Mike Sofen  wrote:

> *From:* Jake Nielsen*Sent:* Tuesday, September 27, 2016 5:22 PM
>
>
> the query
>
> SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
> userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^
>
>
>
> On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
> wrote:
>
> I've got a query that takes a surprisingly long time to run, and I'm
> having a really rough time trying to figure it out.
>
>
>
> Before I get started, here are the specifics of the situation:
>
>
>
> Here is the table that I'm working with (apologies for spammy indices,
> I've been throwing shit at the wall)
>
> Table "public.syncerevent"
>
> Column|  Type   |Modifiers
>
>
> --+-+---
> ---
>
>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
> regclass)
>
>  userid   | text|
>
>  event| text|
>
>  eventid  | text|
>
>  originatorid | text|
>
>  propogatorid | text|
>
>  kwargs   | text|
>
>  conflicted   | integer |
>
> Indexes:
>
> "syncerevent_pkey" PRIMARY KEY, btree (id)
>
> "syncereventidindex" UNIQUE, btree (eventid)
>
> "anothersyncereventidindex" btree (userid)
>
> "anothersyncereventidindexwithascending" btree (userid, id)
>
> "asdfasdgasdf" btree (userid, id DESC)
>
> "syncereventuseridhashindex" hash (userid)
>
>
>
> To provide some context, as per the wiki,
>
> there are 3,290,600 rows in this table.
>
> It gets added to frequently, but never deleted from.
>
> The "kwargs" column often contains mid-size JSON strings (roughly 30K
> characters on average)
>
> As of right now, the table has 53 users in it. About 20% of those have a
> negligible number of events, but the rest of the users have a fairly even
> smattering.
>
>
>
> EXPLAIN (ANALYZE, BUFFERS) says:
>
>
>   QUERY PLAN
>
>
> 
> 
> --
>
>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
> time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244
> read=160198
>
>->  Index Scan using syncerevent_pkey on syncerevent
> (cost=0.43..388147.29 rows=1274560 width=615) (actual
> time=3352.386..3383.100 rows=4000 loops=1)
>
>  Index Cond: (id > 12468)
>
>  Filter: ((propogatorid <> 
> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>
>  Rows Removed by Filter: 1685801
>
>  Buffers: shared hit=120244 read=160198
>
>  Planning time: 0.833 ms
>
>  Execution time: 3407.633 ms
>
> (9 rows)
>
> If it matters/interests you, here is my underlying confusion:
>
> From some internet sleuthing, I've decided that having a table per user
> (which would totally make this problem a non-issue) isn't a great idea.
> Because there is a file per table, having a table per user would not scale.
> My next thought was partial indexes (which would also totally help), but
> since there is also a table per index, this really doesn't side-step the
> problem. My rough mental model says: If there exists a way that a
> table-per-user scheme would make this more efficient, then there should
> also exist an index that could achieve the same effect (or close enough to
> not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
> utilize at least one of the two indexes on the userId column, but clearly
> I'm not understanding something.
>
> Any help in making this query more efficient would be greatly appreciated,
> and any conceptual insights would be extra awesome.
>
> Thanks for reading.
>
> -Jake
>
> --
>
>
>
> This stands out:  WHERE ID > 12468 AND propogatorId NOT IN
> ('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')
>
> As does this from the analyze:  Rows Removed by Filter: 1685801
>
>
>
> The propogaterid is practically the only column NOT indexed and it’s used
> in a “not in”.  It looks like it’s having to do a table scan for all the
> rows above the id cutoff to see if any meet the filter requirement.  “not
> in” can be very expensive.  An index might help on this column.  Have you
> tried that?
>
>
>
> Your rowcounts aren’t high enough to require partitioning or any other
> changes to your table that I can see right now.
>
>
>
> Mike Sofen  (Synthetic Genomics)
>
>
>

Thanks Mike, that's true, I hadn't thought of non-indexed columns forcing a
scan. Unfortunately, just to test this out, I tried pulling out the more
suspect parts of the query, and it still seems to want to do an index scan:


EXPLAIN (ANALYZE, BUFFERS) SELECT * FROM SyncerEvent WHERE userId =
'57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;


QUERY PLAN


Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Mike Sofen
From: Jake NielsenSent: Tuesday, September 27, 2016 5:22 PM


the query

SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN 
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND userId = 
'57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^

 

On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen mailto:jake.k.niel...@gmail.com> > wrote:

I've got a query that takes a surprisingly long time to run, and I'm having a 
really rough time trying to figure it out.

 

Before I get started, here are the specifics of the situation:

 

Here is the table that I'm working with (apologies for spammy indices, I've 
been throwing shit at the wall)

Table "public.syncerevent"

Column|  Type   |Modifiers  
   

--+-+--

 id   | bigint  | not null default 
nextval('syncerevent_id_seq'::regclass)

 userid   | text| 

 event| text| 

 eventid  | text| 

 originatorid | text| 

 propogatorid | text| 

 kwargs   | text| 

 conflicted   | integer | 

Indexes:

"syncerevent_pkey" PRIMARY KEY, btree (id)

"syncereventidindex" UNIQUE, btree (eventid)

"anothersyncereventidindex" btree (userid)

"anothersyncereventidindexwithascending" btree (userid, id)

"asdfasdgasdf" btree (userid, id DESC)

"syncereventuseridhashindex" hash (userid)

 

To provide some context, as per the wiki, 

there are 3,290,600 rows in this table. 

It gets added to frequently, but never deleted from. 

The "kwargs" column often contains mid-size JSON strings (roughly 30K 
characters on average)

As of right now, the table has 53 users in it. About 20% of those have a 
negligible number of events, but the rest of the users have a fairly even 
smattering.

 

EXPLAIN (ANALYZE, BUFFERS) says:

  QUERY 
PLAN  

--

 Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual 
time=3352.390..3403.572 rows=4000 loops=1)  Buffers: shared hit=120244 
read=160198

   ->  Index Scan using syncerevent_pkey on syncerevent  (cost=0.43..388147.29 
rows=1274560 width=615) (actual time=3352.386..3383.100 rows=4000 loops=1)

 Index Cond: (id > 12468)

 Filter: ((propogatorid <> 
'"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1) AND 
(userid = '57dc984f1c87461c0967e228'::text))

 Rows Removed by Filter: 1685801

 Buffers: shared hit=120244 read=160198

 Planning time: 0.833 ms

 Execution time: 3407.633 ms

(9 rows)

If it matters/interests you, here is my underlying confusion:

>From some internet sleuthing, I've decided that having a table per user (which 
>would totally make this problem a non-issue) isn't a great idea. Because there 
>is a file per table, having a table per user would not scale. My next thought 
>was partial indexes (which would also totally help), but since there is also a 
>table per index, this really doesn't side-step the problem. My rough mental 
>model says: If there exists a way that a table-per-user scheme would make this 
>more efficient, then there should also exist an index that could achieve the 
>same effect (or close enough to not matter). I would think that "userid = 
>'57dc984f1c87461c0967e228'" could utilize at least one of the two indexes on 
>the userId column, but clearly I'm not understanding something.

Any help in making this query more efficient would be greatly appreciated, and 
any conceptual insights would be extra awesome.

Thanks for reading.

-Jake

--

 

This stands out:  WHERE ID > 12468 AND propogatorId NOT IN 
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"')

As does this from the analyze:  Rows Removed by Filter: 1685801

 

The propogaterid is practically the only column NOT indexed and it’s used in a 
“not in”.  It looks like it’s having to do a table scan for all the rows above 
the id cutoff to see if any meet the filter requirement.  “not in” can be very 
expensive.  An index might help on this column.  Have you tried that?

 

Your rowcounts aren’t high enough to require partitioning or any other changes 
to your table that I can see right now.

 

Mike Sofen  (Synthetic Genomics)

 



Re: [PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
Herp, forgot to include the query:

SELECT * FROM SyncerEvent WHERE ID > 12468 AND propogatorId NOT IN
('"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"') AND conflicted != 1 AND
userId = '57dc984f1c87461c0967e228' ORDER BY ID LIMIT 4000;^

On Tue, Sep 27, 2016 at 5:02 PM, Jake Nielsen 
wrote:

> I've got a query that takes a surprisingly long time to run, and I'm
> having a really rough time trying to figure it out.
>
> Before I get started, here are the specifics of the situation:
>
> Here is the table that I'm working with (apologies for spammy indices,
> I've been throwing shit at the wall)
>
> Table "public.syncerevent"
>
> Column|  Type   |Modifiers
>
>
> --+-+---
> ---
>
>  id   | bigint  | not null default nextval('syncerevent_id_seq'::
> regclass)
>
>  userid   | text|
>
>  event| text|
>
>  eventid  | text|
>
>  originatorid | text|
>
>  propogatorid | text|
>
>  kwargs   | text|
>
>  conflicted   | integer |
>
> Indexes:
>
> "syncerevent_pkey" PRIMARY KEY, btree (id)
>
> "syncereventidindex" UNIQUE, btree (eventid)
>
> "anothersyncereventidindex" btree (userid)
>
> "anothersyncereventidindexwithascending" btree (userid, id)
>
> "asdfasdgasdf" btree (userid, id DESC)
>
> "syncereventuseridhashindex" hash (userid)
>
> To provide some context, as per the wiki,
> there are 3,290,600 rows in this table.
> It gets added to frequently, but never deleted from.
> The "kwargs" column often contains mid-size JSON strings (roughly 30K
> characters on average)
> As of right now, the table has 53 users in it. About 20% of those have a
> negligible number of events, but the rest of the users have a fairly even
> smattering.
>
> EXPLAIN (ANALYZE, BUFFERS) says:
>
>
>   QUERY PLAN
>
>
> 
> 
> --
>
>  Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
> time=3352.390..3403.572 rows=4000 loops=1)
>
>Buffers: shared hit=120244 read=160198
>
>->  Index Scan using syncerevent_pkey on syncerevent
> (cost=0.43..388147.29 rows=1274560 width=615) (actual
> time=3352.386..3383.100 rows=4000 loops=1)
>
>  Index Cond: (id > 12468)
>
>  Filter: ((propogatorid <> 
> '"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text)
> AND (conflicted <> 1) AND (userid = '57dc984f1c87461c0967e228'::text))
>
>  Rows Removed by Filter: 1685801
>
>  Buffers: shared hit=120244 read=160198
>
>  Planning time: 0.833 ms
>
>  Execution time: 3407.633 ms
>
> (9 rows)
>
>
> The postgres verison is: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled
> by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit
>
>
> This query has gotten slower over time.
>
> The postgres server is running on a db.m3.medium RDS instance on Amazon.
>
> (3.75GB of ram)
>
> (~3 GHz processor, single core)
>
> I ran VACUUM, and ANALYZEd this table just prior to running the EXPLAIN
> command.
>
> Here are the server settings:
>
>  name   | current_setting
>   | source
>
>
>
>
>  application_name   | psql
>  | client
>
>  archive_command| 
> /etc/rds/dbbin/pgscripts/rds_wal_archive
> %p   | configuration file
>
>  archive_mode   | on
>  | configuration file
>
>  archive_timeout| 5min
>  | configuration file
>
>  autovacuum_analyze_scale_factor| 0.05
>| configuration file
>
>  autovacuum_naptime | 30s
>   | configuration file
>
>  autovacuum_vacuum_scale_factor | 0.1
>   | configuration file
>
>  checkpoint_completion_target   | 0.9
>   | configuration file
>
>  client_encoding| UTF8
>  | client
>
>  effective_cache_size   | 1818912kB
>   | configuration file
>
>  fsync  | on
>  | configuration file
>
>  full_page_writes   | on
>  | configuration file
>
>  hot_standby| off
>   | configuration file
>
>  listen_addresses   | *
>   | command line
>
>  lo_compat_privileges   | off
>   | configuration file
>
>  log_checkpoints| on
>  | configuration file
>
>  log_directory  | /rdsdbdata/log/error
>
> Sorry for the formatting, I'm not sure of the best way to format this data
> on a mailing list.
>
>
> If it matters/interests you, here is my underlying confusion:
>
> From some internet sleuthing, I've decided that having a table per use

[PERFORM] Unexpected expensive index scan

2016-09-27 Thread Jake Nielsen
I've got a query that takes a surprisingly long time to run, and I'm having
a really rough time trying to figure it out.

Before I get started, here are the specifics of the situation:

Here is the table that I'm working with (apologies for spammy indices, I've
been throwing shit at the wall)

Table "public.syncerevent"

Column|  Type   |Modifiers


--+-+--

 id   | bigint  | not null default
nextval('syncerevent_id_seq'::regclass)

 userid   | text|

 event| text|

 eventid  | text|

 originatorid | text|

 propogatorid | text|

 kwargs   | text|

 conflicted   | integer |

Indexes:

"syncerevent_pkey" PRIMARY KEY, btree (id)

"syncereventidindex" UNIQUE, btree (eventid)

"anothersyncereventidindex" btree (userid)

"anothersyncereventidindexwithascending" btree (userid, id)

"asdfasdgasdf" btree (userid, id DESC)

"syncereventuseridhashindex" hash (userid)

To provide some context, as per the wiki,
there are 3,290,600 rows in this table.
It gets added to frequently, but never deleted from.
The "kwargs" column often contains mid-size JSON strings (roughly 30K
characters on average)
As of right now, the table has 53 users in it. About 20% of those have a
negligible number of events, but the rest of the users have a fairly even
smattering.

EXPLAIN (ANALYZE, BUFFERS) says:


  QUERY PLAN


--

 Limit  (cost=0.43..1218.57 rows=4000 width=615) (actual
time=3352.390..3403.572 rows=4000 loops=1)

   Buffers: shared hit=120244 read=160198

   ->  Index Scan using syncerevent_pkey on syncerevent
(cost=0.43..388147.29 rows=1274560 width=615) (actual
time=3352.386..3383.100 rows=4000 loops=1)

 Index Cond: (id > 12468)

 Filter: ((propogatorid <>
'"d8130ab9!-66d0!-4f13!-acec!-a9556362f0ad"'::text) AND (conflicted <> 1)
AND (userid = '57dc984f1c87461c0967e228'::text))

 Rows Removed by Filter: 1685801

 Buffers: shared hit=120244 read=160198

 Planning time: 0.833 ms

 Execution time: 3407.633 ms

(9 rows)


The postgres verison is: PostgreSQL 9.5.2 on x86_64-pc-linux-gnu, compiled
by gcc (GCC) 4.8.2 20140120 (Red Hat 4.8.2-16), 64-bit


This query has gotten slower over time.

The postgres server is running on a db.m3.medium RDS instance on Amazon.

(3.75GB of ram)

(~3 GHz processor, single core)

I ran VACUUM, and ANALYZEd this table just prior to running the EXPLAIN
command.

Here are the server settings:

 name   | current_setting
| source




 application_name   | psql
 | client

 archive_command|
/etc/rds/dbbin/pgscripts/rds_wal_archive %p   | configuration file

 archive_mode   | on
 | configuration file

 archive_timeout| 5min
 | configuration file

 autovacuum_analyze_scale_factor| 0.05
 | configuration file

 autovacuum_naptime | 30s
| configuration file

 autovacuum_vacuum_scale_factor | 0.1
| configuration file

 checkpoint_completion_target   | 0.9
| configuration file

 client_encoding| UTF8
 | client

 effective_cache_size   | 1818912kB
| configuration file

 fsync  | on
 | configuration file

 full_page_writes   | on
 | configuration file

 hot_standby| off
| configuration file

 listen_addresses   | *
| command line

 lo_compat_privileges   | off
| configuration file

 log_checkpoints| on
 | configuration file

 log_directory  | /rdsdbdata/log/error

Sorry for the formatting, I'm not sure of the best way to format this data
on a mailing list.


If it matters/interests you, here is my underlying confusion:

>From some internet sleuthing, I've decided that having a table per user
(which would totally make this problem a non-issue) isn't a great idea.
Because there is a file per table, having a table per user would not scale.
My next thought was partial indexes (which would also totally help), but
since there is also a table per index, this really doesn't side-step the
problem. My rough mental model says: If there exists a way that a
table-per-user scheme would make this more efficient, then there should
also exist an index that could achieve the same effect (or close enough to
not matter). I would think that "userid = '57dc984f1c87461c0967e228'" could
util

Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Karl Denninger
On 9/27/2016 16:38, Tomas Vondra wrote:
> On 09/27/2016 06:00 PM, Torsten Zuehlsdorff wrote:
>>
>>
>> On 29.07.2016 08:30, Tomas Vondra wrote:
>>>
>>>
>>> On 07/29/2016 08:04 AM, trafdev wrote:
 Hi.

 I have an OLAP-oriented DB (light occasional bulk writes and heavy
 aggregated selects over large periods of data) based on Postgres
 9.5.3.

 Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on
 ZFS,
 mirror).

 The largest table is 13GB (with a 4GB index on it), other tables
 are 4,
 2 and less than 1GB.

 After reading a lot of articles and "howto-s" I've collected following
 set of tweaks and hints:


 ZFS pools creation:
 zfs create zroot/ara/sqldb
 zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql


 zfs get primarycache,recordsize,logbias,compression
 zroot/ara/sqldb/pgsql
 NAME   PROPERTY  VALUE SOURCE
 zroot/ara/sqldb/pgsql  primarycache  all   local
 zroot/ara/sqldb/pgsql  recordsize8Klocal
 zroot/ara/sqldb/pgsql  logbias   latency   local
 zroot/ara/sqldb/pgsql  compression   lz4   inherited from
 zroot

 L2ARC is disabled
 VDEV cache is disabled


 pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
 pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D
 /ara/sqldb/pgsql/data"


 /etc/sysctl.conf
 vfs.zfs.metaslab.lba_weighting_enabled=0


 postgresql.conf:
 listen_addresses = '*'
 max_connections = 100
 shared_buffers = 16GB
 effective_cache_size = 48GB
>>>
>>> It may not be a problem for your workload, but this
>>> effective_cache_size
>>> value is far too high.
>>
>> May i asked why? ZFS in default caches your size of RAM minus 1 GB.
>> Getting the shared buffer from the 64 GB RAM i would asume 47 GB
>> would be a better value. But this would not be far too high. So
>> please can you explain this?
>
> Because it's not a global value, but an estimate of how much RAM is
> available as a cache for a single query. So if you're running 10
> queries at the same time, they'll have to share the memory.
>
> It's a bit trickier as there's often a fair amount of cross-backend
> sharing (backends accessing the same data, so it's likely one backend
> loads data into cache, and then other backends access it too).
>
> It also ignores that memory may get allocated for other reasons - some
> queries may allocate quite a bit of memory for sorts/aggregations, so
> not only is
>
>effective_cache_size = RAM - shared_buffers
>
> excessive as it ignores the per-query nature, but also because it
> neglects these other allocations.
>
> regards
>
You may well find that with lz4 compression a 128kb record size on that
filesystem is materially faster -- it is here for most workloads under
Postgres.



-- 
Karl Denninger
k...@denninger.net 
/The Market Ticker/
/[S/MIME encrypted email preferred]/


smime.p7s
Description: S/MIME Cryptographic Signature


Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Tomas Vondra

On 09/27/2016 06:00 PM, Torsten Zuehlsdorff wrote:



On 29.07.2016 08:30, Tomas Vondra wrote:



On 07/29/2016 08:04 AM, trafdev wrote:

Hi.

I have an OLAP-oriented DB (light occasional bulk writes and heavy
aggregated selects over large periods of data) based on Postgres 9.5.3.

Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on ZFS,
mirror).

The largest table is 13GB (with a 4GB index on it), other tables are 4,
2 and less than 1GB.

After reading a lot of articles and "howto-s" I've collected following
set of tweaks and hints:


ZFS pools creation:
zfs create zroot/ara/sqldb
zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql


zfs get primarycache,recordsize,logbias,compression
zroot/ara/sqldb/pgsql
NAME   PROPERTY  VALUE SOURCE
zroot/ara/sqldb/pgsql  primarycache  all   local
zroot/ara/sqldb/pgsql  recordsize8Klocal
zroot/ara/sqldb/pgsql  logbias   latency   local
zroot/ara/sqldb/pgsql  compression   lz4   inherited from zroot

L2ARC is disabled
VDEV cache is disabled


pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D /ara/sqldb/pgsql/data"


/etc/sysctl.conf
vfs.zfs.metaslab.lba_weighting_enabled=0


postgresql.conf:
listen_addresses = '*'
max_connections = 100
shared_buffers = 16GB
effective_cache_size = 48GB


It may not be a problem for your workload, but this effective_cache_size
value is far too high.


May i asked why? ZFS in default caches your size of RAM minus 1 GB.
Getting the shared buffer from the 64 GB RAM i would asume 47 GB
would be a better value. But this would not be far too high. So
please can you explain this?


Because it's not a global value, but an estimate of how much RAM is 
available as a cache for a single query. So if you're running 10 queries 
at the same time, they'll have to share the memory.


It's a bit trickier as there's often a fair amount of cross-backend 
sharing (backends accessing the same data, so it's likely one backend 
loads data into cache, and then other backends access it too).


It also ignores that memory may get allocated for other reasons - some 
queries may allocate quite a bit of memory for sorts/aggregations, so 
not only is


   effective_cache_size = RAM - shared_buffers

excessive as it ignores the per-query nature, but also because it 
neglects these other allocations.


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg
On Tue, Sep 27, 2016 at 10:15 AM, Terry Schmitt 
wrote:

>
>
> On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
> wrote:
>
>> Hey all,
>>
>> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
>> time has said not to have millions of tables.  I too have long believed it
>> until recently.
>>
>> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
>> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
>> those tables.  Table creation initially took 0.018031 secs, average
>> 0.027467 and after tossing out outliers (qty 5) the maximum creation time
>> found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049
>> seconds.  Tables were created by a single process.  Do note that table
>> creation is done via plpgsql function as there are other housekeeping tasks
>> necessary though minimal.
>>
>> No system tuning but here is a list of PostgreSQL knobs and switches:
>> shared_buffers = 2GB
>> work_mem = 48 MB
>> max_stack_depth = 4 MB
>> synchronous_commit = off
>> effective_cache_size = 200 GB
>> pg_xlog is on it's own file system
>>
>> There are some still obvious problems.  General DBA functions such as
>> VACUUM and ANALYZE should not be done.  Each will run forever and cause
>> much grief.  Backups are problematic in the traditional pg_dump and PITR
>> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
>> it in my test case) are no-no's.  A system or database crash could take
>> potentially hours to days to recover.  There are likely other issues ahead.
>>
>> You may wonder, "why is Greg attempting such a thing?"  I looked at
>> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
>> it's antiquated and don't get me started on "Hadoop".  I looked at many
>> others and ultimately the recommended use of each vendor was to have one
>> table for all data.  That overcomes the millions of tables problem, right?
>>
>> Problem with the "one big table" solution is I anticipate 1,200 trillion
>> records.  Random access is expected and the customer expects <30ms reads
>> for a single record fetch.
>>
>> No data is loaded... yet  Table and index creation only.  I am interested
>> in the opinions of all including tests I may perform.  If you had this
>> setup, what would you capture / analyze?  I have a job running preparing
>> data.  I did this on a much smaller scale (50k tables) and data load via
>> function allowed close to 6,000 records/second.  The schema has been
>> simplified since and last test reach just over 20,000 records/second with
>> 300k tables.
>>
>> I'm not looking for alternatives yet but input to my test.  Takers?
>>
>> I can't promise immediate feedback but will do my best to respond with
>> results.
>>
>> TIA,
>> -Greg
>>
>
> I have not seen any mention of transaction ID wraparound mentioned in this
> thread yet. With the numbers that you are looking at, I could see this as a
> major issue.
>
> T
>

Thank you Terry.  You get the gold star.  :)   I was waiting for that to
come up.

Success means handling this condition.  A whole database vacuum and
dump-restore is out of the question.  Can a properly tuned autovacuum
prevent the situation?

-Greg


Re: [PERFORM] Millions of tables

2016-09-27 Thread Terry Schmitt
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>
> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>

I have not seen any mention of transaction ID wraparound mentioned in this
thread yet. With the numbers that you are looking at, I could see this as a
major issue.

T


Re: [PERFORM] PostgreSQL on ZFS: performance tuning

2016-09-27 Thread Torsten Zuehlsdorff



On 29.07.2016 08:30, Tomas Vondra wrote:



On 07/29/2016 08:04 AM, trafdev wrote:

Hi.

I have an OLAP-oriented DB (light occasional bulk writes and heavy
aggregated selects over large periods of data) based on Postgres 9.5.3.

Server is a FreeBSD 10.3 with 64GB of RAM and 2x500GB SSD (root on ZFS,
mirror).

The largest table is 13GB (with a 4GB index on it), other tables are 4,
2 and less than 1GB.

After reading a lot of articles and "howto-s" I've collected following
set of tweaks and hints:


ZFS pools creation:
zfs create zroot/ara/sqldb
zfs create -o recordsize=8k -o primarycache=all zroot/ara/sqldb/pgsql


zfs get primarycache,recordsize,logbias,compression zroot/ara/sqldb/pgsql
NAME   PROPERTY  VALUE SOURCE
zroot/ara/sqldb/pgsql  primarycache  all   local
zroot/ara/sqldb/pgsql  recordsize8Klocal
zroot/ara/sqldb/pgsql  logbias   latency   local
zroot/ara/sqldb/pgsql  compression   lz4   inherited from zroot

L2ARC is disabled
VDEV cache is disabled


pgsql -c "mkdir /ara/sqldb/pgsql/data_ix"
pgsql -c "initdb --locale=en_US.UTF-8 -E UTF-8 -D /ara/sqldb/pgsql/data"


/etc/sysctl.conf
vfs.zfs.metaslab.lba_weighting_enabled=0


postgresql.conf:
listen_addresses = '*'
max_connections = 100
shared_buffers = 16GB
effective_cache_size = 48GB


It may not be a problem for your workload, but this effective_cache_size
value is far too high.


May i asked why? ZFS in default caches your size of RAM minus 1 GB. 
Getting the shared buffer from the 64 GB RAM i would asume 47 GB would 
be a better value. But this would not be far too high. So please can you 
explain this?


Greetings,
Torsten


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg
On Tue, Sep 27, 2016 at 9:42 AM, Mike Sofen  wrote:

> *From:* Mike Sofen   *Sent:* Tuesday, September 27, 2016 8:10 AM
>
> *From:* Greg Spiegelberg   *Sent:* Monday, September 26, 2016 7:25 AM
> I've gotten more responses than anticipated and have answered some
> questions and gotten some insight but my challenge again is what should I
> capture along the way to prove or disprove this storage pattern?
> Alternatives to the storage pattern aside, I need ideas to test rig,
> capture metrics and suggestions to tune it.
>
>
>
> In the next 24 hours, I will be sending ~1 trillion records to the test
> database.  Because of time to set up, I'd rather have things set up
> properly the first go.
>
>
>
> Thanks!
>
> -Greg
>
> -
>
> Greg, I ran another quick test on a wider table than you’ve described, but
> this time with 80 million rows, with core counts, ram and ssd storage
> similar to what you’d have on that AWS EC2 instance.  This table had 7
> columns (3 integers, 3 text, 1 timestamptz) with an average width of 157
> chars, one btree index on the pk int column.  Using explain analyze, I
> picked one id value out of the 80m and ran a select * where id = x.  It did
> an index scan, had a planning time of 0.077ms, and an execution time of
> 0.254 seconds.  I ran the query for a variety of widely spaced values (so
> the data was uncached) and the timing never changed. This has been
> mirroring my general experience with PG – very fast reads on indexed
> queries.
>
>
>
> Summary:  I think your buckets can be WAY bigger than you are envisioning
> for the simple table design you’ve described.  I’m betting you can easily
> do 500 million rows per bucket before approaching anything close to the
> 30ms max query time.
>
>
>
> Mike Sofen (Synthetic Genomics)
>
>
>
> Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not
> SECONDS.  Thus my comment about going up 10x in bucket size instead of
> appearing to be right at the limit.  Sorry!
>
>
>
I figured.  :)

Haven't ruled it out but expectations of this implementation is to perform
at worst 3X slower than memcache or Redis.

Bigger buckets mean a wider possibility of response times.  Some buckets
may contain 140k records and some 100X more.

-Greg


Re: [PERFORM] Millions of tables

2016-09-27 Thread Greg Spiegelberg
On Tue, Sep 27, 2016 at 8:30 AM, Craig James  wrote:

> On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
> wrote:
>
>> Hey all,
>>
>> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a
>> time has said not to have millions of tables.  I too have long believed it
>> until recently.
>>
>> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1)
>> for PGDATA.  Over the weekend, I created 8M tables with 16M indexes on
>> those tables.  Table creation initially took 0.018031 secs, average
>> 0.027467 and after tossing out outliers (qty 5) the maximum creation time
>> found was 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049
>> seconds.  Tables were created by a single process.  Do note that table
>> creation is done via plpgsql function as there are other housekeeping tasks
>> necessary though minimal.
>>
>> No system tuning but here is a list of PostgreSQL knobs and switches:
>> shared_buffers = 2GB
>> work_mem = 48 MB
>> max_stack_depth = 4 MB
>> synchronous_commit = off
>> effective_cache_size = 200 GB
>> pg_xlog is on it's own file system
>>
>> There are some still obvious problems.  General DBA functions such as
>> VACUUM and ANALYZE should not be done.  Each will run forever and cause
>> much grief.  Backups are problematic in the traditional pg_dump and PITR
>> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
>> it in my test case) are no-no's.  A system or database crash could take
>> potentially hours to days to recover.  There are likely other issues ahead.
>>
>> You may wonder, "why is Greg attempting such a thing?"  I looked at
>> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
>> it's antiquated and don't get me started on "Hadoop".  I looked at many
>> others and ultimately the recommended use of each vendor was to have one
>> table for all data.  That overcomes the millions of tables problem, right?
>>
>> Problem with the "one big table" solution is I anticipate 1,200 trillion
>> records.  Random access is expected and the customer expects <30ms reads
>> for a single record fetch.
>>
>
> You don't give enough details to fully explain the problem you're trying
> to solve.
>
>- Will records ever be updated or deleted? If so, what percentage and
>at what frequency?
>- What specifically are you storing (e.g. list of integers, strings,
>people's sex habits, ...)? Or more importantly, are these fixed- or
>variable-sized records?
>- Once the 1,200 trillion records are loaded, is that it? Or do more
>data arrive, and if so, at what rate?
>- Do your queries change, or is there a fixed set of queries?
>- How complex are the joins?
>
> Excellent questions.

1a. Half of the 4M tables will contain ~140k records and UPDATE's will
occur on roughly 100 records/day/table.  No DELETE's on this first half.
1b. Second half of the 4M tables will contain ~200k records.  Zero UPDATE's
however DELETE's will occur on ~100 records/day/table.

2. All 4M tables contain 7 columns: (4) bigints, (2) timestamptz and (1)
boolean.  2M of the table will have an PKEY on (1) bigint table only.
Second 2M table have a PKEY on (bigint,timestamptz) and two additional
indexes on (bigint, timestamptz) different columns.

3. The trillions-of-records load is just to push the system to find the
maximum record load capability.  Reality, 200M records / day or
~2,300/second average is the expectation once in production.

4. Queries are fixed and match the indexes laid down on the tables.  Goal
is <30ms/query.  I have attempted queries with and without indexes.
Without indexes the average query response varied between 20ms and 40ms
whereas indexes respond within a much tighter range of 5ms to 9ms.  Both
query performance tests were done during data-ingest.

5. Zero JOIN's and I won't let it ever happen.  However the 4M tables
INHERIT a data grouping table.  Test rig limits child tables to
1,000/parent.  This was done to explore some other possible access patterns
but they are secondary and if it doesn't work then either a) the
requirement will be dropped or b) I may look at storing the data in the
1,000 child tables directly in the parent table and I'll need to re-run
load & read tests.



> The reason I ask these specific questions is because, as others have
> pointed out, this might be a perfect case for a custom (non-relational)
> database. Relational databases are general-purpose tools, sort of like a
> Swiss-Army knife. A Swiss-Army knife does most things passably, but if you
> want to carve wood, or butcher meat, or slice vegetables, you get a knife
> meant for that specific task.
>
>
I'm at a sizing phase.  If 4M tables works I'll attempt 16M tables.  If it
points to only 2M or 1M then that's fine.  The 4M table database in only a
single cog in the storage service design.  Anticipating ~40 of these
databases but it is dependent upon how many tables work in a single
instance.

The 4M tables are strict r

Re: [PERFORM] Millions of tables

2016-09-27 Thread Mike Sofen
From: Mike Sofen   Sent: Tuesday, September 27, 2016 8:10 AM



From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions 
and gotten some insight but my challenge again is what should I capture along 
the way to prove or disprove this storage pattern?  Alternatives to the storage 
pattern aside, I need ideas to test rig, capture metrics and suggestions to 
tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test 
database.  Because of time to set up, I'd rather have things set up properly 
the first go.

 

Thanks!

-Greg 

-

Greg, I ran another quick test on a wider table than you’ve described, but this 
time with 80 million rows, with core counts, ram and ssd storage similar to 
what you’d have on that AWS EC2 instance.  This table had 7 columns (3 
integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree 
index on the pk int column.  Using explain analyze, I picked one id value out 
of the 80m and ran a select * where id = x.  It did an index scan, had a 
planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the 
query for a variety of widely spaced values (so the data was uncached) and the 
timing never changed. This has been mirroring my general experience with PG – 
very fast reads on indexed queries. 

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for 
the simple table design you’ve described.  I’m betting you can easily do 500 
million rows per bucket before approaching anything close to the 30ms max query 
time.

 

Mike Sofen (Synthetic Genomics)

 

Totally typo’d the execution time:  it was 0.254 MILLISECONDS, not SECONDS.  
Thus my comment about going up 10x in bucket size instead of appearing to be 
right at the limit.  Sorry!

 

Mike



Re: [PERFORM] Millions of tables

2016-09-27 Thread Mike Sofen
 

 

From: Greg Spiegelberg   Sent: Monday, September 26, 2016 7:25 AM
I've gotten more responses than anticipated and have answered some questions 
and gotten some insight but my challenge again is what should I capture along 
the way to prove or disprove this storage pattern?  Alternatives to the storage 
pattern aside, I need ideas to test rig, capture metrics and suggestions to 
tune it.

 

In the next 24 hours, I will be sending ~1 trillion records to the test 
database.  Because of time to set up, I'd rather have things set up properly 
the first go.

 

Thanks!

-Greg 

-

Greg, I ran another quick test on a wider table than you’ve described, but this 
time with 80 million rows, with core counts, ram and ssd storage similar to 
what you’d have on that AWS EC2 instance.  This table had 7 columns (3 
integers, 3 text, 1 timestamptz) with an average width of 157 chars, one btree 
index on the pk int column.  Using explain analyze, I picked one id value out 
of the 80m and ran a select * where id = x.  It did an index scan, had a 
planning time of 0.077ms, and an execution time of 0.254 seconds.  I ran the 
query for a variety of widely spaced values (so the data was uncached) and the 
timing never changed. This has been mirroring my general experience with PG – 
very fast reads on indexed queries. 

 

Summary:  I think your buckets can be WAY bigger than you are envisioning for 
the simple table design you’ve described.  I’m betting you can easily do 500 
million rows per bucket before approaching anything close to the 30ms max query 
time.

 

Mike Sofen (Synthetic Genomics)



Re: [PERFORM] Millions of tables

2016-09-27 Thread Craig James
On Sun, Sep 25, 2016 at 7:50 PM, Greg Spiegelberg 
wrote:

> Hey all,
>
> Obviously everyone who's been in PostgreSQL or almost any RDBMS for a time
> has said not to have millions of tables.  I too have long believed it until
> recently.
>
> AWS d2.8xlarge instance with 9.5 is my test rig using XFS on EBS (io1) for
> PGDATA.  Over the weekend, I created 8M tables with 16M indexes on those
> tables.  Table creation initially took 0.018031 secs, average 0.027467 and
> after tossing out outliers (qty 5) the maximum creation time found was
> 0.66139 seconds.  Total time 30 hours, 31 minutes and 8.435049 seconds.
> Tables were created by a single process.  Do note that table creation is
> done via plpgsql function as there are other housekeeping tasks necessary
> though minimal.
>
> No system tuning but here is a list of PostgreSQL knobs and switches:
> shared_buffers = 2GB
> work_mem = 48 MB
> max_stack_depth = 4 MB
> synchronous_commit = off
> effective_cache_size = 200 GB
> pg_xlog is on it's own file system
>
> There are some still obvious problems.  General DBA functions such as
> VACUUM and ANALYZE should not be done.  Each will run forever and cause
> much grief.  Backups are problematic in the traditional pg_dump and PITR
> space.  Large JOIN's by VIEW, SELECT or via table inheritance (I am abusing
> it in my test case) are no-no's.  A system or database crash could take
> potentially hours to days to recover.  There are likely other issues ahead.
>
> You may wonder, "why is Greg attempting such a thing?"  I looked at
> DynamoDB, BigTable, and Cassandra.  I like Greenplum but, let's face it,
> it's antiquated and don't get me started on "Hadoop".  I looked at many
> others and ultimately the recommended use of each vendor was to have one
> table for all data.  That overcomes the millions of tables problem, right?
>
> Problem with the "one big table" solution is I anticipate 1,200 trillion
> records.  Random access is expected and the customer expects <30ms reads
> for a single record fetch.
>

You don't give enough details to fully explain the problem you're trying to
solve.

   - Will records ever be updated or deleted? If so, what percentage and at
   what frequency?
   - What specifically are you storing (e.g. list of integers, strings,
   people's sex habits, ...)? Or more importantly, are these fixed- or
   variable-sized records?
   - Once the 1,200 trillion records are loaded, is that it? Or do more
   data arrive, and if so, at what rate?
   - Do your queries change, or is there a fixed set of queries?
   - How complex are the joins?

The reason I ask these specific questions is because, as others have
pointed out, this might be a perfect case for a custom (non-relational)
database. Relational databases are general-purpose tools, sort of like a
Swiss-Army knife. A Swiss-Army knife does most things passably, but if you
want to carve wood, or butcher meat, or slice vegetables, you get a knife
meant for that specific task.

I've written several custom database-storage systems for very specific
high-performance systems. It's generally a couple weeks of work, and you
have a tailored performance and storage that's hard for a general-purpose
relational system to match.

The difficulty of building such a system depends a lot on the answers to
the questions above.

Craig


> No data is loaded... yet  Table and index creation only.  I am interested
> in the opinions of all including tests I may perform.  If you had this
> setup, what would you capture / analyze?  I have a job running preparing
> data.  I did this on a much smaller scale (50k tables) and data load via
> function allowed close to 6,000 records/second.  The schema has been
> simplified since and last test reach just over 20,000 records/second with
> 300k tables.
>
> I'm not looking for alternatives yet but input to my test.  Takers?
>
> I can't promise immediate feedback but will do my best to respond with
> results.
>
> TIA,
> -Greg
>



-- 
-
Craig A. James
Chief Technology Officer
eMolecules, Inc.
-