Re: Zedstore - compressed in-core columnar storage
On 23/05/19 12:07 PM, Ashwin Agrawal wrote: We (Heikki, me and Melanie) are continuing to build Zedstore. Wish to share the recent additions and modifications. Attaching a patch with the latest code. Link to github branch [1] to follow along. The approach we have been leaning towards is to build required functionality, get passing the test and then continue to iterate to optimize the same. It's still work-in-progress. Sharing the details now, as have reached our next milestone for Zedstore. All table AM API's are implemented for Zedstore (except compute_xid_horizon_for_tuples, seems need test for it first). Current State: - A new type of item added to Zedstore "Array item", to boost compression and performance. Based on Konstantin's performance experiments [2] and inputs from Tomas Vodra [3], this is added. Array item holds multiple datums, with consecutive TIDs and the same visibility information. An array item saves space compared to multiple single items, by leaving out repetitive UNDO and TID fields. An array item cannot mix NULLs and non-NULLs. So, those experiments should result in improved performance now. Inserting data via COPY creates array items currently. Code for insert has not been modified from last time. Making singleton inserts or insert into select, performant is still on the todo list. - Now we have a separate and dedicated meta-column btree alongside rest of the data column btrees. This special or first btree for meta-column is used to assign TIDs for tuples, track the UNDO location which provides visibility information. Also, this special btree, which always exists, helps to support zero-column tables (which can be a result of ADD COLUMN DROP COLUMN actions as well). Plus, having meta-data stored separately from data, helps to get better compression ratios. And also helps to further simplify the overall design/implementation as for deletes just need to edit the meta-column and avoid touching the actual data btrees. Index scans can just perform visibility checks based on this meta-column and fetch required datums only for visible tuples. For tuple locks also just need to access this meta-column only. Previously, every column btree used to carry the same undo pointer. Thus visibility check could be potentially performed, with the past layout, using any column. But considering overall simplification new layout provides it's fine to give up on that aspect. Having dedicated meta-column highly simplified handling for add columns with default and null values, as this column deterministically provides all the TIDs present in the table, which can't be said for any other data columns due to default or null values during add column. - Free Page Map implemented. The Free Page Map keeps track of unused pages in the relation. The FPM is also a b-tree, indexed by physical block number. To be more compact, it stores "extents", i.e. block ranges, rather than just blocks, when possible. An interesting paper [4] on how modern filesystems manage space acted as a good source for ideas. - Tuple locks implemented - Serializable isolation handled - With "default_table_access_method=zedstore" - 31 out of 194 failing regress tests - 10 out of 86 failing isolation tests Many of the current failing tests are due to plan differences, like Index scans selected for zedstore over IndexOnly scans, as zedstore doesn't yet have visibility map. I am yet to give a thought on index-only scans. Or plan diffs due to table size differences between heap and zedstore. Next few milestones we wish to hit for Zedstore: - Make check regress green - Make check isolation green - Zedstore crash safe (means also replication safe). Implement WAL logs - Performance profiling and optimizations for Insert, Selects, Index Scans, etc... - Once UNDO framework lands in Upstream, Zedstore leverages it instead of its own version of UNDO Open questions / discussion items: - how best to get "column projection list" from planner? (currently, we walk plan and find the columns required for the query in the executor, refer GetNeededColumnsForNode()) - how to pass the "column projection list" to table AM? (as stated in initial email, currently we have modified table am API to pass the projection to AM) - TID treated as (block, offset) in current indexing code - Physical tlist optimization? (currently, we disabled it for zedstore) Team: Melanie joined Heikki and me to write code for zedstore. Majority of the code continues to be contributed by Heikki. We are continuing to have fun building column store implementation and iterate aggressively. References: 1] https://github.com/greenplum-db/postgres/tree/zedstore 2] https://www.postgresql.org/message-id/3978b57e-fe25-ca6b-f56c-48084417e115%40postgrespro.ru 3] https://www.postgresql.org/message-id/20190415173254.nlnk2xqhgt7c5pta%40development 4] https://www.kernel.org/doc/ols/2010/ols2010-pages
Re: Zedstore - compressed in-core columnar storage
On 11/04/19 4:01 PM, Mark Kirkwood wrote: On 9/04/19 12:27 PM, Ashwin Agrawal wrote: Heikki and I have been hacking recently for few weeks to implement in-core columnar storage for PostgreSQL. Here's the design and initial implementation of Zedstore, compressed in-core columnar storage (table access method). Attaching the patch and link to github branch [1] to follow along. Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. The COPY works with LZ4 configured.
Re: Zedstore - compressed in-core columnar storage
On 9/04/19 12:27 PM, Ashwin Agrawal wrote: Heikki and I have been hacking recently for few weeks to implement in-core columnar storage for PostgreSQL. Here's the design and initial implementation of Zedstore, compressed in-core columnar storage (table access method). Attaching the patch and link to github branch [1] to follow along. Very nice. I realize that it is very early days, but applying this patch I've managed to stumble over some compression bugs doing some COPY's: benchz=# COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; psql: ERROR: compression failed. what now? CONTEXT: COPY dim1, line 458 The log has: 2019-04-11 15:48:43.976 NZST [2006] ERROR: XX000: compression failed. what now? 2019-04-11 15:48:43.976 NZST [2006] CONTEXT: COPY dim1, line 458 2019-04-11 15:48:43.976 NZST [2006] LOCATION: zs_compress_finish, zedstore_compression.c:287 2019-04-11 15:48:43.976 NZST [2006] STATEMENT: COPY dim1 FROM '/data0/dump/dim1.dat' USING DELIMITERS ','; The dataset is generated from and old DW benchmark I wrote (https://sourceforge.net/projects/benchw/). The row concerned looks like: 457,457th interesting measure,1th measure type,aqwycdevcmybxcnpwqgrdsmfelaxfpbhfxghamfezdiwfvneltvqlivstwralshsppcpchvdkdbraoxnkvexdbpyzgamajfp 458,458th interesting measure,2th measure type,bjgdsciehjvkxvxjqbhtdwtcftpfewxfhfkzjsdrdabbvymlctghsblxucezydghjrgsjjjnmmqhncvpwbwodhnzmtakxhsg I'll see if changing to LZ4 makes any different. best wishes Mark
Re: PostgreSQL pollutes the file system
On 22/03/19 3:05 PM, Tom Lane wrote: > Michael Paquier writes: >> I would be curious to hear the reason why such tool names have been >> chosen from the start. The tools have been switched to C in 9e0ab71 >> from 2003, have been introduced by Peter Eisentraut as of 240e4c9 from >> 1999, and I cannot spot the thread from the time where this was >> discussed. > createuser, at least, dates back to Berkeley days: my copy of the > PG v4r2 tarball contains a "src/bin/createuser/createuser.sh" file > dated 1994-03-19. (The 1999 commit you mention just moved the > functionality around; it was there before.) So I imagine the answer > is that nobody at the time thought of fitting these scripts into a > larger ecosystem. FWIW the whole set is there in version 6.4.2: markir@vedavec:/download/postgres/src/postgresql-6.4.2/src/bin$ ls -l total 72 drwxr-sr-x 3 markir adm 4096 Dec 31 1998 cleardbdir drwxr-sr-x 3 markir adm 4096 Dec 31 1998 createdb drwxr-sr-x 3 markir adm 4096 Dec 31 1998 createuser drwxr-sr-x 2 markir adm 4096 Dec 31 1998 CVS drwxr-sr-x 3 markir adm 4096 Dec 31 1998 destroydb drwxr-sr-x 3 markir adm 4096 Dec 31 1998 destroyuser drwxr-sr-x 3 markir adm 4096 Dec 31 1998 initdb drwxr-sr-x 3 markir adm 4096 Dec 31 1998 initlocation drwxr-sr-x 3 markir adm 4096 Dec 31 1998 ipcclean -rw-r--r-- 1 markir adm 795 Dec 19 1998 Makefile drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pgaccess drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pg_dump drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pg_encoding drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pg_id drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pg_passwd drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pgtclsh drwxr-sr-x 3 markir adm 4096 Dec 31 1998 pg_version drwxr-sr-x 3 markir adm 4096 Dec 31 1998 psql -- Mark
Re: Remove Deprecated Exclusive Backup Mode
On 26/02/19 5:41 PM, Stephen Frost wrote: Greetings Mark, * Mark Kirkwood (mark.kirkw...@catalyst.net.nz) wrote: ISTM that the onus should be on the patch submitter to provide additions to pg_basebackup that make it as painless as possible for those people *not* using pgBackRest to continue making backups. Breaking this is just not right. Submitting patches that mean that people *must* use pgBackRest is also not right IMHO. I'm sorry that there's some confusion here- to be clear, no one is required to use pgBackRest. pg_basebackup works quite well and wouldn't be impacted by the changes proposed no this thread. The arguments against removing the exclusive backup feature don't have anything to do with pg_basebackup. Ah yes (checks pg_basbackup code), you are correct! Reading this thread I thought I saw a comment to the effect that pg_basebackup was being broken, hence the less than impressed post. Your relentless bashing of people doing their own backups and heavy marketing of pgBackRest - unfortunately - made it easy for me to believe that this was a possibility that you might see as ok. So - apologies for the misunderstanding, however less marketing of your own product would avoid me jumping to the wrong conclusion. regards Mark
Re: Remove Deprecated Exclusive Backup Mode
On 26/02/19 4:53 PM, Robert Haas wrote: On Mon, Feb 25, 2019 at 4:38 PM David Steele wrote: FWIW, if you weren't selling backrest quite so hard everywhere backups are mentioned, I'd find this thread a lot more convicing. pgBackRest has not used exclusive backups since the new API was introduced in 9.6 so this is not an issue for our users. Over time we have contributed back to Postgres in areas we thought could be improved based on our work on the pgBackRest project: 6ad8ac60, 9fe3c644, 017e4f25, 78874531, 449338cc, 98267ee8, 8694cc96, 920a5e50, c37b3d08, 5fc1670b, b981df4c. This does not include the various backup related patches that we have reviewed. If promoting pgBackRest were our primary concern then it would be in our interest to allow Postgres exclusive backups to stay broken and pg_basebackup to be as primitive as possible. Hmm, so what you're saying is that you'd like to disable an API that some non-backrest users are relying upon but which no backrest users are relying upon. And you don't understand why some non-backrest users are opposed to that plan. Is that a correct summary of your position? +1 to Robert's Management Summary. ISTM that the onus should be on the patch submitter to provide additions to pg_basebackup that make it as painless as possible for those people *not* using pgBackRest to continue making backups. Breaking this is just not right. Submitting patches that mean that people *must* use pgBackRest is also not right IMHO. Finally, Open Source is about is working together to make the a common project (in this case Poistgres) better - not forcing us to use something else you have written (even if it is good). regards Mark
Re: Code of Conduct plan
On 15/09/18 08:17, Tom Lane wrote: Yeah, this. The PG community is mostly nice people, AFAICT. I'll be astonished (and worried) if the CoC committee finds much to do. We're implementing this mostly to make newcomers to the project feel that it's a safe space. Agreed. However I think the all-of-life clause gives an open door to potential less than well intentioned new members joining up to extend a SJW agenda. So in fact the unintended consequence of this may be a *less* safe place for some existing members - unless all of their social media utterances are agreeable to the angry militant left. It's also worth reminding people that this is v1.0 of the CoC document. We plan to revisit it in a year or so, and thereafter as needed, to improve anything that's causing problems or not working well. +1, At least this means we can address the above if it emerges as a problem regards Mark regards, tom lane
Re: Is a modern build system acceptable for older platforms
On 27/04/18 19:10, Yuriy Zhuravlev wrote: 1. You can remove tools/msvc folder because all your build rules will be universal. (cmake build now have much fewer lines of code) 2. You can forget about terminal in Windows (for windows guys it's important) 3. You can normally check environment on Windows, right now we have hardcoded headers and many options. Configure process will be same on all platforms. 4. You can generate not only GNU Make or MSVC project, you also can make Xcode projects, Ninja or NMake for build under MSVC Make. For Windows, you also can easily change MSVC to Clang it's not hardcoded at all. 5. With CMake you have an easy way to build extra modules (plugins), I have already working prototype for windows PGXS. A plugin should just include .cmake file generated with Postgres build. Example: https://github.com/stalkerg/postgres_cmake/blob/cmake/contrib/adminpack/CMakeLists.txt If PGXS is True it's mean we build module outside postgres. But in my opinion, you should just try CMake to figure out all benefits. I note that Mysql (yeah I know, we don't love 'em greatly, but their product niche is similar to ours) and Ceph (ok it is a distributed storage system but still a highly popular open src product) have switched to using cmake (relatively) recently. Both these projects were using autoconf etc related builds previously and seem to be doing just fine with cmake. regards Mark
Re: PostgreSQL's handling of fsync() errors is unsafe and risks data loss at least on XFS
On 19/04/18 00:45, Craig Ringer wrote: I guarantee you that when you create a 100GB EBS volume on AWS EC2, you don't get 100GB of storage preallocated. AWS are probably pretty good about not running out of backing store, though. Some db folks (used to anyway) advise dd'ing to your freshly attached devices on AWS (for performance mainly IIRC), but that would help prevent some failure scenarios for any thin provisioned storage (but probably really annoy the admins' thereof). regards Mark
Re: zheap: a new storage format for PostgreSQL
On 03/03/18 05:03, Robert Haas wrote: On Fri, Mar 2, 2018 at 5:35 AM, Alexander Korotkov wrote: I would propose "zero-bloat heap" disambiguation of zheap. Seems like fair enough explanation for me without need to rename :) It will be possible to bloat a zheap table in certain usage patterns. For example, if you bulk-load the table with a ton of data, commit the transaction, delete every other row, and then never insert any more rows ever again, the table is bloated: it's twice as large as it really needs to be, and we have no provision for shrinking it. In general, I think it's very hard to keep bulk deletes from leaving bloat in the table, and to the extent that it *is* possible, we're not doing it. One could imagine, for example, an index-organized table that automatically combines adjacent pages when they're empty enough, and that also relocates data to physically lower-numbered pages whenever possible. Such a storage engine might automatically shrink the on-disk footprint after a large delete, but we have no plans to go in that direction. Rather, our assumption is that the bloat most people care about comes from updates. By performing updates in-place as often as possible, we hope to avoid bloating both the heap (because we're not adding new row versions to it which then have to be removed) and the indexes (because if we don't add new row versions at some other TID, then we don't need to add index pointers to that new TID either, or remove the old index pointers to the old TID). Without delete-marking, we can basically optimize the case that is currently handled via HOT updates: no indexed columns have changed. However, the in-place update has a major advantage that it still works even when the page is completely full, provided that the row does not expand. As Amit's results show, that can hugely reduce bloat and increase performance in the face of long-running concurrent transactions. With delete-marking, we can also optimize the case where indexed columns have been changed. We don't know exactly how well this will work yet because the code isn't written and therefore can't be benchmarked, but am hopeful that that in-place updates will be a big win here too. So, I would not describe a zheap table as zero-bloat, but it should involve a lot less bloat than our standard heap. For folk doing ETL type data warehousing this should be great, as the typical workload tends to be like: COPY (or similar) from foreign data source, then do several sets of UPDATES to fix/check/scrub the data...which tends to result in huge bloat with the current heap design (despite telling people 'you can do it another way to' to avoid bloat - I guess it seems to be more intuitive to just to do it as described). regards Mark
Re: zheap: a new storage format for PostgreSQL
On 02/03/18 16:53, Alvaro Herrera wrote: I think it was impolite to post this on the very same day the commitfest started. We have enough patches as it is ... To be fair - he did say things like "wanting feedback..." and "shows an example of using pluggable storage.." and for PG 12. If he held onto the patches and waited - he'd get criticism of the form "you should have given a heads up earlier...". This is earlier :-) Best wishes Mark P.s: awesome work.
Re: Changing default value of wal_sync_method to open_datasync on Linux
On 20/02/18 13:27, Tsunakawa, Takayuki wrote: Hello, I propose changing the default value of wal_sync_method from fdatasync to open_datasync on Linux. The patch is attached. I'm feeling this may be controversial, so I'd like to hear your opinions. The reason for change is better performance. Robert Haas said open_datasync was much faster than fdatasync with NVRAM in this thread: https://www.postgresql.org/message-id/flat/c20d38e97bcb33dad59e...@lab.ntt.co.jp#c20d38e97bcb33dad59e...@lab.ntt.co.jp pg_test_fsync shows higher figures for open_datasync: [SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered] -- 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 50829.597 ops/sec 20 usecs/op fdatasync 42094.381 ops/sec 24 usecs/op fsync 42209.972 ops/sec 24 usecs/op fsync_writethroughn/a open_sync 48669.605 ops/sec 21 usecs/op -- [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback] (the figures seem oddly high, though; this may be due to some VM configuration) -- 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 34648.778 ops/sec 29 usecs/op fdatasync 31570.947 ops/sec 32 usecs/op fsync 27783.283 ops/sec 36 usecs/op fsync_writethrough n/a open_sync 35238.866 ops/sec 28 usecs/op -- pgbench only shows marginally better results, although the difference is within an error range. The following is the tps of the default read/write workload of pgbench. I ran the test with all the tables and indexes preloaded with pg_prewarm (except pgbench_history), and the checkpoint not happening. I ran a write workload before running the benchmark so that no new WAL file would be created during the benchmark run. [SSD on bare metal, ext4 volume mounted with noatime,nobarrier,data=ordered] -- 1 2 3avg fdatasync 17610 17164 16678 17150 open_datasync 17847 17457 17958 17754 (+3%) [HDD on VM, ext4 volume mounted with noatime,nobarrier,data=writeback] (the figures seem oddly high, though; this may be due to some VM configuration) -- 1 2 3 avg fdatasync 4911 5225 5198 5111 open_datasync 4996 5284 5317 5199 (+1%) As the removed comment describes, when wal_sync_method is open_datasync (or open_sync), open() fails with errno=EINVAL if the ext4 volume is mounted with data=journal. That's because open() specifies O_DIRECT in that case. I don't think that's a problem in practice, because data=journal will not be used for performance, and wal_level needs to be changed from its default replica to minimal and max_wal_senders must be set to 0 for O_DIRECT to be used. I think the use of 'nobarrier' is probably disabling most/all reliable writing to the devices. What do the numbers look like if use remove this option? regards Mark
Re: Partition pruning for Star Schema
On 04/12/17 17:20, Mark Kirkwood wrote: On 04/12/17 16:08, Ashutosh Bapat wrote: On Sun, Dec 3, 2017 at 5:56 AM, legrand legrand wrote: Hello, I have a typical star schema, having dimension tables "product", "calendar" and "country" and a fact table "sales". This fact table is partitionned by time (range by month) and country (list). Will query like: select product.name, calendar.month, sum(sales.net_price) from sales inner join product on (product.id = sales.cust_id) inner join country on (country.id = sales.country_id) inner join calendar on (calendar.id = sales.calendar_id) where country.name = 'HERE' and calendar.year = '2017' group by product.name,calendar.month be able to identify needed partitions ? AFAIU partition pruning, it works only with the partition key columns. So, if country.name and calendar.year are the partition keys partition pruning would identify the needed partitions from those tables. But planner doesn't know that calendar.year is somehow related to calendar.id and then transfer that knowledge so that partitions of sales can be identified. If you can get your code to perform a star transformation on this type of query, then you might see some partition pruning. Actually it won't - sorry. To get that to work, you would need to evaluate the additional subqueries to produce fixed values! The patch for 'runtime partition pruning' might be what you want tho. Cheers Mark
Re: Partition pruning for Star Schema
On 04/12/17 16:08, Ashutosh Bapat wrote: On Sun, Dec 3, 2017 at 5:56 AM, legrand legrand wrote: Hello, I have a typical star schema, having dimension tables "product", "calendar" and "country" and a fact table "sales". This fact table is partitionned by time (range by month) and country (list). Will query like: select product.name, calendar.month, sum(sales.net_price) from sales inner join product on (product.id = sales.cust_id) inner join country on (country.id = sales.country_id) inner join calendar on (calendar.id = sales.calendar_id) where country.name = 'HERE' and calendar.year = '2017' group by product.name,calendar.month be able to identify needed partitions ? AFAIU partition pruning, it works only with the partition key columns. So, if country.name and calendar.year are the partition keys partition pruning would identify the needed partitions from those tables. But planner doesn't know that calendar.year is somehow related to calendar.id and then transfer that knowledge so that partitions of sales can be identified. If you can get your code to perform a star transformation on this type of query, then you might see some partition pruning. Cheers Mark