Alex, That might be very good idea. In fact, what you describe closely resembles TempDB in MS SQL Server [1]. It is also offloaded to disk, minimally logged and purged on restart. Ideally we could try designing this component in generic way, so that it could store a lot different temporal stuff: 1) Locks 2) UNDO data 3) Sort/join data (for SELECT and CREATE INDEX, statistics, whatsoever) 4) If needed - visibility info (e.g. for covering indexes and purge/vacuum)
WDYT? Vladimir. [1] https://docs.microsoft.com/en-us/sql/relational-databases/databases/tempdb-database On Fri, Dec 15, 2017 at 4:26 PM, Alexey Goncharuk < alexey.goncha...@gmail.com> wrote: > Vladimir, > > What about moving the entire locking mechanism to a separate off-heap > memory region which will be volatile wrt restarts, but will still support > off-load to disk. In the current architecture, it means that we will need > to allocate a separate DataRegion with no WAL and no crash recovery - locks > are meaningless after a restart, and we will automatically drop them. I > would be interesting to prototype this because I think we may be on-par > with on-heap lock placement, as we already proved for in-memory caches. > > 2017-12-14 21:53 GMT+03:00 Denis Magda <dma...@apache.org>: > > > Vladimir, > > > > No it’s crystal clear, thanks. > > > > If this approach works only for Ignite persistence based deployment, how > > will we handle locking for pure in-memory and caching of 3rd party > > databases scenarios? As I understand the tuples still will be stored in > the > > page memory while there won’t be any opportunity to fallback to disk if > the > > memory usage increases some threshold. > > > > — > > Denis > > > > > On Dec 13, 2017, at 11:21 PM, Vladimir Ozerov <voze...@gridgain.com> > > wrote: > > > > > > Denis, > > > > > > Sorry, may be I was not clear enough - "tuple-approach" and "persistent > > > approach" are the same. By "tuple" I mean a row stored inside a data > > block. > > > Currently we store lock information in Java heap and proposal is to > move > > it > > > to data blocks. The main driver is memory - if there are a rows to be > > > locked we will either run out of memory, or produce serious memory > > > pressure. For example, currently update of 1M entries will consume > ~500Mb > > > of heap. With proposed approach it will consume almost nothing. The > > > drawback is increased number of dirty data pages, but it should not be > a > > > problem because in final implementation we will update data rows before > > > prepare phase anyway, so I do not expect any write amplification in > usual > > > case. > > > > > > This approach is only applicable for Ignite persistence. > > > > > > On Thu, Dec 14, 2017 at 1:53 AM, Denis Magda <dma...@apache.org> > wrote: > > > > > >> Vladimir, > > >> > > >> Thanks for a throughout overview and proposal. > > >> > > >>> Also we could try employing tiered approach > > >>> 1) Try to keep everything in-memory to minimize writes to blocks > > >>> 2) Fallback to persistent lock data if certain threshold is reached. > > >> > > >> What are the benefits of the backed-by-persistence approach in compare > > to > > >> the one based on tuples? Specifically: > > >> - will the persistence approach work for both 3rd party and Ignite > > >> persistence? > > >> - any performance impacts depending on a chosen method? > > >> - what’s faster to implement? > > >> > > >> — > > >> Denis > > >> > > >>> On Dec 13, 2017, at 2:10 AM, Vladimir Ozerov <voze...@gridgain.com> > > >> wrote: > > >>> > > >>> Igniters, > > >>> > > >>> As you probably we know we work actively on MVCC [1] and > transactional > > >> SQL > > >>> [2] features which could be treated as a single huge improvement. We > > >> face a > > >>> number of challenges and one of them is locking. > > >>> > > >>> At the moment information about all locks is kept in memory on > > per-entry > > >>> basis (see GridCacheMvccManager). For every locked key we maintain > > >> current > > >>> lock owner (XID) and the list of would-be-owner transactions. When > > >>> transaction is about to lock an entry two scenarios are possible: > > >>> 1) If entry is not locked we obtain the lock immediately > > >>> 2) if entry is locked we add current transaction to the wait list and > > >> jumps > > >>> to the next entry to be locked. Once the first entry is released by > > >>> conflicting transaction, current transaction becomes an owner of the > > >> first > > >>> entry and tries to promote itself for subsequent entries. > > >>> > > >>> Once all required locks are obtained, response is sent to the caller. > > >>> > > >>> This approach doesn't work well for transactional SQL - if we update > > >>> millions of rows in a single transaction we will simply run out of > > >> memory. > > >>> To mitigate the problem other database vendors keep information about > > >> locks > > >>> inside the tuples. I propose to apply the similar design as follows: > > >>> > > >>> 1) No per-entry lock information is stored in memory anymore. > > >>> 2) The list of active transactions are maintained in memory still > > >>> 3) When TX locks an entry, it sets special marker to the tuple [3] > > >>> 4) When TX meets already locked entry, it enlists itself to wait > queue > > of > > >>> conflicting transaction and suspends > > >>> 5) When first transaction releases conflicting lock, it notifies and > > >> wakes > > >>> up suspended transactions, so they resume locking > > >>> 6) Entry lock data is cleared on transaction commit > > >>> 7) Entry lock data is not cleared on rollback or node restart; > Instead, > > >> we > > >>> will could use active transactions list to identify invalid locks and > > >>> overwrite them as needed. > > >>> > > >>> Also we could try employing tiered approach > > >>> 1) Try to keep everything in-memory to minimize writes to blocks > > >>> 2) Fallback to persistent lock data if certain threshold is reached. > > >>> > > >>> Thoughts? > > >>> > > >>> [1] https://issues.apache.org/jira/browse/IGNITE-3478 > > >>> [2] https://issues.apache.org/jira/browse/IGNITE-4191 > > >>> [3] Depends on final MVCC design - it could be per-tuple XID, undo > > >> vectors, > > >>> per-block transaction lists, etc.. > > >>> > > >>> Vladimir. > > >> > > >> > > > > >