Vlad, All, On 06/25/2014 05:58 PM, Vlad Khorsun wrote: >> AFAIK, concurrent GC in pre-committed transactions without transaction lock >> was unsafe, and created >> permanently broken databases. >> I believe corruption has become especially apparent after commit intended to >> fix CORE-3515. I don't >> know who fixed it, but Red Soft builds include a patch for this problem for >> about a year. I was >> surprised to learn last week than vanilla Firebird still has the problem. > Do you speak about CORE-3921 ? If yes, it was fixed more than half a > year ago. > > ...
Yes, you are correct. I didn't know about existence of CORE-3921. By "vanilla" I meant lastest released version Firebird 2.5.2. I compared patch with 2.5.2 source code, and that was probably my mistake. >>>> We tried to be extra careful not to hurt performance while establishing >>>> these snapshots. >>> > We do it much more efficiently than done for isc_tpb_concurrency. >>> >>> AFAIU, you maintain list of active transactions at top-level request. >>> You create this list >>> using new Lock Manager call which iterates thru LCK_tra locks and calls >>> callback which fills >>> list (actually BePlusTree) of active transactions. >>> Later reader consult this list (i.e. search BePlusTree) to check >>> transaction state. >> Yes, the data structure used is a bit more involved, but in general you are >> correct. >> >> BePlusTree based SparseBitmap implementation is quite quick > Let me disagree here ;) Please, do not accuse me of bad implementation without backing it up with some sort of numbers. BePlusTree and SparseBitmap both spent quite some time under profiler, and are extremely fast and scalable data structures. Usually much faster than ad-hoc data structures used elsewhere. > >> and is currently used in the most performance-sensitive parts of the code >> (data index retrieval by >> index, for example). > Sorry, but i don't understand what is "data index retrieval by index"... You know that it is used to store intermediate RecordBitmap's during index retrieval operations. This is very performance sensitive code, and this implementation outperformed and used less memory than older SBM-based implementation. >>> Is it really faster than copy part of (already cached) TIP contents >>> into transaction and then >>> test a bit in a plain vector ? Also, it allows to reuse existing code. >> No, this actually would have horrible performance characteristics. Take >> BroadView, for example. >> They have many millions transactions processed daily, and TIP size of >> 1500-4000 pages is typical. >> Imagine how costly is to copy such a thing? > Do you speak about active part of the TIP (between OIT and Next) or > about whole TIP ? I can easily > imagine TIP of 4000 pages. It is not a problem as we have no need to copy all > that pages, only starting > from OIT value. But if BroadView permanently have stuck OIT with difference > of few millions... it is not > good for them. Hope it is not so. Well, currently entire TIP is read/scanned/copied in quite a few places. AFAIK, there is no optimization to analyze "interesting" portion only. Even so, your assumptions that applications can generate only small number of transactions are not correct. Both Red Soft and BroadView applications process an order of 10 millons transactions daily on each database. Sometimes more. There could also be some long-running operations (e.g. exports or reports). So OIT gap of 10^6 represents 1-2 hours of work and is relatively healthy. For example, now I can see OIT gap of 500k on a perfectly healthy moderately loaded database instance (client "V" of BroadView). On some instances, daily sweep is impossible due to performance reasons, and it is performed weekly. So you can get a gap 40-60 million transactions between oldest transaction and next. And yes, database has to be taken offline for backup-restore once in a while, or you hit 2 billion transaction mark. > > I guess that memove of 4-8-16KB is much faster than 100-1000 calls of > callback with search in > BePlusTree... Also, note, you pay additional cost of the search in BePlusTree > every time you test > transaction state, i.e. for every record version accessed by RC transaction. I think your assumptions about comparative performance overheads of different operations are incorrect. To use TPC for setting up snapshot one has to ensure it is current and consistent while you set up the snapshot. This is very expensive proposition, when compared to a few simple function calls. In my code I went to great lengths to avoid even a single page read when setting up the snapshot. Applications tend to submit very large number of queries (many thousands per second) and slow snapshots would kill them. BTW, AST overhead on obtaining STATEMENT_ID is the major bottleneck in this situation, until you apply the appropriate patch. I can only assume this change didn't make it to FB3 because costs of 2 context switches and multiple syscalls per ID cache fetch are assumed to be zero by the project. I don't know. Oh, yes, and to add insult to an injury, despite costly synchronization this counter wraps around 32-bit integer often enough to be non-unique anyway. :-) > To make a brief resume - i can't point to an error in your logic (it doesn't > mean it have no error), I share your paranoia and reviewed the code once again. There was a small inconsistency in handling long-running RO/RC transactions. On one side, I was a bit too pessimistic about them and let them hold GC for too long, and on the other side, there was one extremely unlikely case when it was still possible for RC/RO transaction to observe inconsistent data. Both problems are fixed, and there are no additional performance implications. So there is now a clean implementation of statement-level read consistency, and it is in Red Soft stable tree, used by default. It will be tested in the field, and we will see how it performs in practice. But it still inhibits GC for RO/RC transactions in some cases, which AFAIU is considered unacceptable by the Firebird project. I currently study the problem of GC of intermediate versions. It seems doable, with the logic along the following lines: 1. To protect snapshots from GC one needs to acquire 2 kinds of locks: - transaction interest lock (one for each transaction active at snapshot start, lck_key = ID of active transaction) - snapshot lock (one for each snapshot, lck_key = ID of top transaction at snapshot start) Idea is that: - you cannot GC versions for which there is transaction interest lock - at most one record version is needed for each snapshot 2. If we found a version (in VIO_chase_record_version and VIO_garbage_collect), we may use the following logic to GC it: - if version satisfies conditions for purge then use that algorithm for GC - version is candidate for "intermediate GC" if it is not current, there is no interest lock on its transaction, its forward version corresponds to the same snapshot as this version and is committed 3. "Intermediate GC" algorithm needs to study the entire version chain, and determine which versions may go and which need to stay. Actual GC may use the same idea that is used in update_in_place: Imagine you have versions "A", "I", "B". Intermediate version "I" needs to go, and "A" needs to stay. First, convert delta version "A" to full version. Second, update back-pointers to it in "B". Third, delete version "I". Forth, GC blobs and indices as appropriate. This is the general idea, there are many corner cases which need to be handled as well. Difficult part is how to implement this logic so it does not hurt performance for regular operations. This seems possible, for as long we can extend lock manager interface a little (LCK_lock_multiple_nowait, LCK_release_multiple, etc). This direction will take a bit of time to study, implement, test and profile. Thanks, Nikolay ------------------------------------------------------------------------------ Open source business process management suite built on Java and Eclipse Turn processes into business applications with Bonita BPM Community Edition Quickly connect people, data, and systems into organized workflows Winner of BOSSIE, CODIE, OW2 and Gartner awards http://p.sf.net/sfu/Bonitasoft Firebird-Devel mailing list, web interface at https://lists.sourceforge.net/lists/listinfo/firebird-devel