Thank you for your feedback.

On 19.02.2021 6:25, Tomas Vondra wrote:
On 1/22/21 5:04 PM, Konstantin Knizhnik wrote:
...

I have heard from several DBMS experts that appearance of huge and
cheap non-volatile memory can make a revolution in database system
architecture. If all database can fit in non-volatile memory, then we
do not need buffers, WAL, ...>
But although  multi-terabyte NVM announces were made by IBM several
years ago, I do not know about some successful DBMS prototypes with new
architecture.

I tried to understand why...

IMHO those predictions are a bit too optimistic, because they often
assume PMEM behavior is mostly similar to DRAM, except for the extra
persistence. But that's not quite true - throughput with PMEM is much
lower
Actually it is not completely true.
There are several types of NVDIMMs.
Most popular now is NVDIMM-N which is just combination of DRAM and flash.
Speed it the same as of normal DRAM, but size of such memory is also comparable with DRAM.
So I do not think that it is perspective approach.
And definitely speed of Intel Optane memory is much slower than of DRAM.
But the main advantage of PMEM from my point of view is that it allows
to avoid write-ahead logging at all!
No, PMEM certainly does not allow avoiding write-ahead logging - we
still need to handle e.g. recovery after a crash, when the data files
are in unknown / corrupted state.

It is possible to avoid write-ahead logging if we use special algorithms (like PMwCAS)
which ensures atomicity of updates.
The problem with removing buffer manager and just writing everything
directly to PMEM is the worse latency/throughput (compared to DRAM).
It's probably much more efficient to combine multiple writes into RAM
and then do one (much slower) write to persistent storage, than pay the
higher latency for every write.

It might make sense for data sets that are larger than DRAM but can fit
into PMEM. But that seems like fairly rare case, and even then it may be
more efficient to redesign the schema to fit into RAM somehow (sharding,
partitioning, ...).

Certainly avoid buffering will make sense only if speed of accessing PMEM will be comparable with DRAM.
So I have to patch code of this library instead of just using it:

   g...@github.com:postgrespro/bztree.git

I have not tested yet most iterating case: access to PMEM through PMDK.
And I do not have hardware for such tests.
But first results are also seem to be interesting: PMwCAS is kind of
lockless algorithm and it shows much better scaling at
NUMA host comparing with standard Postgres.

I have done simple parallel insertion test: multiple clients are
inserting data with random keys.
To make competition with vanilla Postgres more honest I used unlogged
table:

create unlogged table t(pk int, payload int);
create index on t using bztree(pk);

randinsert.sql:
insert into t (payload,pk) values
(generate_series(1,1000),random()*1000000000);

pgbench -f randinsert.sql -c N -j N -M prepared -n -t 1000 -P 1 postgres

So each client is inserting one million records.
The target system has 160 virtual and 80 real cores with 256GB of RAM.
Results (TPS) are the following:

N      nbtree      bztree
1           540          455
10         993        2237
100     1479        5025

So bztree is more than 3 times faster for 100 clients.
Just for comparison: result for inserting in this table without index is
10k TPS.

I'm not familiar with bztree, but I agree novel indexing structures are
an interesting topic on their own. I only quickly skimmed the bztree
paper, but it seems it might be useful even on DRAM (assuming it will
work with replication etc.).

The other "problem" with placing data files (tables, indexes) on PMEM
and making this code PMEM-aware is that these writes generally happen
asynchronously in the background, so the impact on transaction rate is
fairly low. This is why all the patches in this thread try to apply PMEM
on the WAL logging / flushing, which is on the critical path.

I want to make an update on my prototype.
Unfortunately my attempt to use bztree with PMEM failed,
because of two problems:

1. Used libpmemobj/bztree libraries are not compatible with Postgres architecture. Them support concurrent access, but by multiple threads within one process (widely use thread-local variables). The traditional Postgres approach (initialize shared data structures in postmaster (shared_preload_libraries) and inherit it by forked child processes) doesn't work for libpmemobj.
If child doesn't open pmem itself, then any access to it cause crash.
And in case of openning pmem by child, it is assigned different virtual memory address. But bztree and pmwcas implementations expect that addresses are the same in all clients.

2. There is some bug in bztree/pmwcas implementation which cause its own test to hang in case of multithreaded access in persistence mode. I tried to find the reason of the problem but didn;t succeed yet: PMwCAS implementation is very non-trivial).

So I just compared single threaded  performance of bztree test: with Intel Optane it was about two times worser
than with volatile memory.

I still wonder if using bztree just as in-memory index will be interested because it is scaling much better than Postgres B-Tree and even our own PgPro in_memory extension. But certainly volatile index has very limited usages. Also full support of all Postgres types in bztree requires a lot of efforts
(right now I support only equality comparison).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Reply via email to