Re: Suggestion to improve deadlock detection

Павлухин Иван Mon, 24 Dec 2018 04:56:01 -0800

Igor, thanks for your feedback! Let's do according to your proposal.

пн, 24 дек. 2018 г. в 14:57, Seliverstov Igor <gvvinbl...@gmail.com>:
>
> Ivan,
>
> We have to design configuration carefully.
> There are possible issues with compatibility and this API is public, it
> might be difficult to redesign it after while.
>
> Since there is a ticket about MVCC related configuration parts [1] (TxLog
> memory region/size, vacuum thread count and intervals, etc)
> I suggest solving the issue in scope of that ticket.
>
> At now, let's introduce a couple of system variables for an ability to
> change the behavior while development phase.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-7966
>
> Regards,
> Igor
>
> чт, 20 дек. 2018 г. в 16:29, Павлухин Иван <vololo...@gmail.com>:
>
> > Hi,
> >
> > I prepared a patch with deadlock detection algorithm implementation
> > (you can find it from ticket [1]).
> >
> > Now I would like to discuss options for configuring deadlock
> > detection. From a usability standpoint following need to be supported:
> > 1. Turn deadlock detection on/off (on by default).
> > 2. Configure a delay between starting waiting for a lock and sending
> > first deadlock detection message (e.g. 1 second by default).
> > 3. Something else?
> >
> > I can suggest following options:
> > 1. Use a single long for configuration. Values < 0 means disabled
> > deadlock detection. 0 means starting detection without a delay. Values
> > > 0 specifies a delay in milliseconds before starting detection.
> > 2. Use a boolean flag for turning detection on/off. Use a long value
> > for configuring delay.
> >
> > Also there are number of ways to pass this values to Ignite. I do no
> > have much experience in adding new configuration options to Ignite and
> > I need an advice here. For example, it could be done by adding
> > additional properties to TransactionConfiguration class. Another way
> > is using java system properties.
> >
> > Igor, Vladimir and others what do you think?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > ср, 19 дек. 2018 г. в 15:51, Павлухин Иван <vololo...@gmail.com>:
> > >
> > > Igor,
> > >
> > > I see your points. And I agree to start with "lightweight
> > > implementation". Today we have 2PL and there is no activity on
> > > implementing rollback to savepoint. And if we do it in the future we
> > > will have to return to the subject of deadlock detection anyway.
> > >
> > > I will proceed with "forward-only" approach.
> > > вт, 18 дек. 2018 г. в 18:33, Seliverstov Igor <gvvinbl...@gmail.com>:
> > > >
> > > > Ivan,
> > > >
> > > > I would prefer forward-only implementation even knowing it allows false
> > > > positive check results.
> > > >
> > > > Why I think so:
> > > >
> > > > 1) From my experience, when we have any future is waiting for reply, we
> > > > have to take a failover into consideration.
> > > > Usually failover implementations are way more complex than an initial
> > > > algorithm itself.
> > > > Forward-only version doesn't need any failover implementation as it
> > expects
> > > > failed probes (the probes didn't face a deadlock ).
> > > >
> > > > 2) Simple lightweight feature implementation is a really good point to
> > > > start - it may be extended if needed but really often such extending
> > > > doesn't need at all.
> > > >
> > > > 3) Any implementation allow false positive result - we are working with
> > > > distributed system, there are a bunch of processes happens at the same
> > time
> > > > and,
> > > > for example, concurrently happened rollback on timeout may solve a
> > deadlock
> > > > but probe is finished at this time, so- there is a rollback on deadlock
> > > > also as a result.
> > > >
> > > > 4) The described case (when false positive result is probable) isn't
> > usual
> > > > but, potentially, extremely rare, I don't think we should solve it
> > since it
> > > > doesn't break the grid.
> > > >
> > > > Regards,
> > > > Igor
> > > >
> > > > вт, 18 дек. 2018 г. в 17:57, Павлухин Иван <vololo...@gmail.com>:
> > > >
> > > > > Hi folks,
> > > > >
> > > > > During implementation of edge-chasing deadlock detection algorithm in
> > > > > scope of [1] it has been realized that basically we have 2 options
> > for
> > > > > "chasing" strategy. I will use terms Near when GridNearTxLocal is
> > > > > assumed and Dht when GridDhtTxLocal (tx which updates primary
> > > > > partition). So, here is 2 mentioned strategies:
> > > > > 1. Send initial probe when Dht tx blocks waiting for another tx to
> > > > > release a key. Initial probe is sent from waiting Dht tx to Near tx
> > > > > holding a lock. If receiving Near tx is blocked as well then it
> > relays
> > > > > the probe to Dht tx it awaits response from. So, the probe is
> > > > > traveling like Dht -> Near -> Dht -> Near -> ... Let's call the
> > > > > approach "forward-only".
> > > > > 2. Send probes only between Near transactions. This approach requires
> > > > > additional request-response call which Near tx issues to Dht node to
> > > > > check if tx sending a request is waiting for another tx. The call
> > > > > returns identifier of a Near tx blocking tx sending a request (if
> > > > > sending tx is in fact blocked). Then waiting Near tx relays a probe
> > to
> > > > > blocked Near tx. Let's call that approach "lock-checking".
> > > > >
> > > > > I would like to define several points to consider while comparing
> > > > > approaches (please point out more if I miss something):
> > > > > 1. Correctness. Especially regarding reporting false deadlocks.
> > > > > 2. Extensibility. Rollback to savepoint and generalization for
> > classic
> > > > > transactions should be considered.
> > > > > 3. Messaging overhead.
> > > > > 4. Simplicity of implementation.
> > > > >
> > > > > You can find an implementation of "lock-checking" approach in PR
> > > > > attached to the ticket [1]. Currently it works correctly only for
> > > > > transactions obeying strict two-phase locking (2PL). It is fairly
> > easy
> > > > > to adopt PR to implement "forward-only" approach. Which will work
> > > > > correct in conditions of 2PL. "lock-checking" algorithm uses 1.5
> > times
> > > > > more messages than "forward-only". Implementation of "forward-only"
> > > > > algorithm is simpler in a sense that it does not require introducing
> > > > > lock-wait-check messages and future. But as for me it does not look
> > as
> > > > > big deal. I think that implementation efforts for both approaches are
> > > > > almost equal so far.
> > > > >
> > > > > But corner stone here is an extensibility. Let's imagine that we are
> > > > > going to implement rollback to savepoint. One suggested approach to
> > > > > extend "forward-only" approach for that case is invalidating sent
> > > > > probe. Basically, a tx initiated deadlock check assigns unique id to
> > a
> > > > > probe. And this probe id is invalidated when the tx wakes up from
> > > > > wait. If that tx receives back a probe with invalidated id it simply
> > > > > discards it. If id is not invalidated it means a deadlock. But false
> > > > > deadlock still can be detected. I will provide an example when it
> > does
> > > > > not work correctly.
> > > > >
> > > > > Please note, that our transactions can request multiple locks
> > > > > simultaneously (so-called AND model). Also rollback to savepoint
> > > > > releases locks obtained after creating a savepoint. Let's use
> > > > > following notation. "*" means savepoint. "^" means rollback to
> > > > > previous savepoint. "!" means acquiring a lock. "?" means waiting for
> > > > > a lock. "&" means requesting multiple locks simultaneously. See the
> > > > > following transaction execution scenario for 3 transactions and 3
> > > > > keys:
> > > > > TA        | TB     |   TC
> > > > >             |          |   *
> > > > >             |  1!      |  2!
> > > > > 1? & 3! |           |
> > > > >             |  2?     |
> > > > >             |          |  ^
> > > > >             |          |  3!
> > > > >
> > > > > We can notice that suggested scenario does not introduce deadlock
> > > > > because locks are requested in consistent order among all
> > > > > transactions. But in different moments there exist waiting relations
> > > > > TA w TB, TB w TC, TC w TA. So, we suspect a false deadlock could be
> > > > > detected here. Also one can notice that a relation TC w TA is
> > > > > established after TB w TC is destroyed. Messages on a timeline
> > diagram
> > > > > demonstrates how "forward-only with probe invalidation" approach can
> > > > > detect a false deadlock here [2]. On the picture L means "lock", U
> > > > > "unlock", W "wait". Red cross means reporting a false deadlock.
> > > > > Intuition here is that while probe is in flight one wait-for edge can
> > > > > be destroyed (TB w TC) and another one can be created (TC w TA). So,
> > > > > the approach can report false deadlocks. Also, note that a false
> > > > > deadlock can be found even when locking order is consistent!
> > > > >
> > > > > After a while I will describe how "lock-checking" can be adopted to
> > > > > support rollback to savepoint. Disclaimer here is that it will
> > require
> > > > > even more messages.
> > > > > ----
> > > > >
> > > > > I think that we can already start a discussion.
> > > > > Actually, while false deadlocks can be surprising we perhaps can
> > > > > tolerate it because our transaction implementation assumes that
> > > > > transactions can be rolled back by system. "forward-only" approach
> > > > > requires lesser messages but in big-O terms both approaches have the
> > > > > same complexity. Both correctness and complexity impact unfortunately
> > > > > have hardly predictable impact for real workflows. In ideal world
> > > > > thorough testing could give the answers.
> > > > >
> > > > > Please, share your vision.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > [2]
> > > > >
> > https://gist.githubusercontent.com/pavlukhin/c8c7c6266eeab56048c31f5cdfb31d20/raw/7d1aef9abcd014ec9fdf69840168768ced638b6c/msg_diagram.jpg
> > > > > пн, 26 нояб. 2018 г. в 15:27, Павлухин Иван <vololo...@gmail.com>:
> > > > > >
> > > > > > Vladimir,
> > > > > >
> > > > > > I think it might work. So, if nobody minds I can start prototyping
> > > > > > edge-chasing approach.
> > > > > > пн, 26 нояб. 2018 г. в 14:32, Vladimir Ozerov <
> > voze...@gridgain.com>:
> > > > > > >
> > > > > > > Ivan,
> > > > > > >
> > > > > > > The problem is that in our system a transaction may wait for N
> > locks
> > > > > > > simultaneously. This may form complex graphs which spread
> > between many
> > > > > > > nodes. Now consider that I have a deadlock between 4 nodes: A ->
> > B ->
> > > > > *C*
> > > > > > > -> D -> A. I've sent a message from a and never reached D
> > because C
> > > > > failed.
> > > > > > > Well, may be that is good, because some transactions will be
> > rolled
> > > > > back.
> > > > > > > But when they are rolled back, another cycles from pending
> > multi-way
> > > > > > > deadlocks may form again. E.g. A -> B -> *E* -> D -> A. So the
> > > > > question is
> > > > > > > whether we will be able to detect them reliably. I think that we
> > may
> > > > > use
> > > > > > > the following assumption:
> > > > > > > 1) If data node fails, relevant transactions will be rolled-back
> > > > > > > 2) It means that some other transactions will make at least
> > partial
> > > > > > > progress with locks acquisition
> > > > > > >
> > > > > > > So, we can introduce a kind of counter which will be advanced on
> > every
> > > > > > > acquired lock on a given node. And define a rule that transaction
> > > > > deadlock
> > > > > > > request for the given pair (NODE_ID, COUNTER) should be
> > requested at
> > > > > most
> > > > > > > once. This way, even if specific deadlock check request is lost
> > due to
> > > > > node
> > > > > > > failure, and another deadlock has formed, then some other node
> > will
> > > > > > > re-trigger deadlock change request sooner or later, as it's
> > counter
> > > > > > > advanced.
> > > > > > >
> > > > > > > Makes sense?
> > > > > > >
> > > > > > > On Sat, Nov 24, 2018 at 5:40 PM Павлухин Иван <
> > vololo...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > > > Hi Vladimir,
> > > > > > > >
> > > > > > > > Regarding fault tolerance. It seems that it is not a problem
> > for
> > > > > > > > edge-chasing approaches. A found deadlock is identified by a
> > message
> > > > > > > > returned to a detection initiator with initiator's identifier.
> > If
> > > > > > > > there is no deadlock then such message will not come. If some
> > node
> > > > > > > > containing a deadlocked transaction fails then it will break
> > the
> > > > > > > > deadlock. Am I missing something?
> > > > > > > >
> > > > > > > > About messaging overhead. Indeed, it looks like edge-chasing
> > > > > > > > approaches can bring redundant messages. Perhaps, we can
> > borrow some
> > > > > > > > ideas about optimization from [1]. And I also think about a
> > timeout
> > > > > > > > before starting a deadlock detection.
> > > > > > > >
> > > > > > > > Thoughts about adaptive timeouts lead me to some thoughts about
> > > > > > > > monitoring. I assume that long waiting for locks signals about
> > not
> > > > > > > > optimal performance of the system. I think it would be great
> > to have
> > > > > > > > means of monitoring transactions contention. It could help
> > users to
> > > > > > > > improve theirs workloads. It also could help us to build some
> > > > > > > > reasoning how contention correlates with deadlock detection
> > overhead.
> > > > > > > >
> > > > > > > > [1] http://mentallandscape.com/Papers_podc84.pdf
> > > > > > > > вс, 18 нояб. 2018 г. в 10:52, Vladimir Ozerov <
> > voze...@gridgain.com
> > > > > >:
> > > > > > > > >
> > > > > > > > > Hi Ivan,
> > > > > > > > >
> > > > > > > > > Great analysis. Agree that edge-chasing looks like better
> > > > > candidate.
> > > > > > > > First,
> > > > > > > > > it will be applicable to both normal and MVCC transactions.
> > > > > Second, in
> > > > > > > > MVCC
> > > > > > > > > we probably will also need to release some locks when doing
> > > > > rollbacks.
> > > > > > > > What
> > > > > > > > > we should think about is failover - what if a node which was
> > in the
> > > > > > > > middle
> > > > > > > > > of a graph fails? We need to craft fault-tolerance carefully.
> > > > > > > > >
> > > > > > > > > Another critically important point is how to trigger deadlock
> > > > > detection.
> > > > > > > > My
> > > > > > > > > concerns about edge-chasing was not about latency, but about
> > a
> > > > > number of
> > > > > > > > > messages which travels between nodes. Good algorithm must
> > produce
> > > > > little
> > > > > > > > to
> > > > > > > > > no messages in case of normal contenion while still providing
> > > > > protection
> > > > > > > > in
> > > > > > > > > case of real deadlocks. So how would we trigger fraph
> > traversal
> > > > > for the
> > > > > > > > > given transaction? May be we can start with hard timeout and
> > then
> > > > > employ
> > > > > > > > a
> > > > > > > > > kind of adaptive increment in case high rate of
> > false-positives are
> > > > > > > > > observed. May be something else.
> > > > > > > > >
> > > > > > > > > On Sun, Nov 18, 2018 at 10:21 AM Павлухин Иван <
> > > > > vololo...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Vladimir,
> > > > > > > > > >
> > > > > > > > > > Thanks for the articles! I studied them and a couple of
> > others.
> > > > > And I
> > > > > > > > > > would like to share a knowledge I found.
> > > > > > > > > >
> > > > > > > > > > BACKGROUND
> > > > > > > > > > First of all our algorithm implemented in
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > org.apache.ignite.internal.processors.cache.transactions.TxDeadlockDetection
> > > > > > > > > > is not an edge-chasing algorithm. In essence a lock-waiting
> > > > > > > > > > transaction site polls nodes responsible for keys of
> > interest
> > > > > one by
> > > > > > > > > > one and reconstructs global wait-for graph (WFG) locally.
> > > > > > > > > > A centralized algorithm discussed in this thread looks
> > similar to
> > > > > > > > > > algorithms proposed by Ho [1]. The simplest of them
> > (two-phased)
> > > > > > > > > > reports false deadlocks when unlock before transaction
> > finish is
> > > > > > > > > > permitted. So, it seems that it only works when strict
> > two-phase
> > > > > > > > > > locking (2PL) is obeyed. Another one (called one-phased)
> > requires
> > > > > > > > > > tracking all locked keys by each transaction which is not
> > > > > desirable
> > > > > > > > > > for MVCC transactions.
> > > > > > > > > > Aforementioned edge-chasing algorithm by Chandy, Misra and
> > Haas
> > > > > [2] is
> > > > > > > > > > proven to work even when 2PL is not obeyed.
> > > > > > > > > > Also performance target is not clear for me. It looks like
> > > > > centralized
> > > > > > > > > > approaches can use fewer messages and provide lower
> > latency that
> > > > > > > > > > distributed ones. But would like to understand what are the
> > > > > orders of
> > > > > > > > > > latency and messaging overhead. Many of algorithms are
> > described
> > > > > in
> > > > > > > > > > [3] and some performance details are mentioned. It is said
> > "As
> > > > > per
> > > > > > > > > > [GRAY81], most deadlocks involve two to four
> > transactions." I
> > > > > see one
> > > > > > > > > > more argument making deadlock detection algorithm latency
> > not so
> > > > > > > > > > important. We can consider deadlock as unavoidable
> > situation,
> > > > > but even
> > > > > > > > > > lightning fast detector will not save a database from
> > performance
> > > > > > > > > > degradation in case of tons of deadlocks.
> > > > > > > > > > What is also interesting, Google Cloud Spanner employs
> > simple
> > > > > > > > > > "wound-wait" deadlock prevention [4] instead of any
> > detection
> > > > > > > > > > algorithm.
> > > > > > > > > >
> > > > > > > > > > DISCUSSION
> > > > > > > > > > So, I see the following options:
> > > > > > > > > > 1. Edge-chasing algorithm like Chandy, Misra and Haas [2].
> > > > > > > > > > 2. Centralized two-phased algorithm described by Ho [1]
> > with
> > > > > > > > > > restriction that transactions must obey 2PL.
> > > > > > > > > > Personally, I tend to edge-chasing approach because it
> > looks
> > > > > simple
> > > > > > > > > > and universal. Restricting ourselves with 2PL will
> > restrict us in
> > > > > > > > > > features which could be implemented in future (e.g.
> > transaction
> > > > > > > > > > savepoints), will not it?
> > > > > > > > > >
> > > > > > > > > > WDYT?
> > > > > > > > > >
> > > > > > > > > > Ivan
> > > > > > > > > >
> > > > > > > > > > [1]
> > > > > https://turing.cs.hbg.psu.edu/comp512.papers/HoRamamoorthy-82.pdf
> > > > > > > > > > [2]
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > > [3]
> > https://www.ics.uci.edu/~cs237/reading/p37-elmagarmid.pdf
> > > > > > > > > > [4]
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://cloud.google.com/spanner/docs/transactions#rw_transaction_performance
> > > > > > > > > >
> > > > > > > > > > ср, 14 нояб. 2018 г. в 22:13, Vladimir Ozerov <
> > > > > voze...@gridgain.com>:
> > > > > > > > > > >
> > > > > > > > > > > Ivan,
> > > > > > > > > > >
> > > > > > > > > > > This is interesting question. I think we should spend
> > some
> > > > > time for
> > > > > > > > > > formal
> > > > > > > > > > > verification whether this algorithm works or not. Several
> > > > > articles
> > > > > > > > you
> > > > > > > > > > may
> > > > > > > > > > > use as a startitng point: [1], [2]. From what I
> > understand,
> > > > > Ignite
> > > > > > > > fall
> > > > > > > > > > > into "AND" model, and currently implemented algorithm is
> > a
> > > > > variation
> > > > > > > > of
> > > > > > > > > > > "edge-chasing" approach as per Chandy, Misra and Haas
> > [3],
> > > > > which is
> > > > > > > > > > *proven
> > > > > > > > > > > to be correct* in that it both detects deadlocks when
> > they are
> > > > > > > > present,
> > > > > > > > > > and
> > > > > > > > > > > do not produce false positives. But is might be too
> > heavy for
> > > > > the
> > > > > > > > system
> > > > > > > > > > > under contention.
> > > > > > > > > > >
> > > > > > > > > > > We need to search for any formal proof of correctness of
> > > > > proposed
> > > > > > > > > > > algorithm. This area is already researched throughly
> > enough,
> > > > > so we
> > > > > > > > should
> > > > > > > > > > > be able to find an answer quickly.
> > > > > > > > > > >
> > > > > > > > > > > Vladimir.
> > > > > > > > > > >
> > > > > > > > > > > [1] http://www.cse.scu.edu/~jholliday/dd_9_16.htm
> > > > > > > > > > > [2] https://www.cs.uic.edu/~ajayk/Chapter10.pdf
> > > > > > > > > > > [3]
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > >
> > > > >
> > https://www.cs.utexas.edu/users/misra/scannedPdf.dir/DistrDeadlockDetection.pdf
> > > > > > > > > > >
> > > > > > > > > > > On Wed, Nov 14, 2018 at 6:55 PM Павлухин Иван <
> > > > > vololo...@gmail.com>
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > > Hi,
> > > > > > > > > > > >
> > > > > > > > > > > > Next part as promised. A working item for me is a
> > deadlock
> > > > > detector
> > > > > > > > > > > > for MVCC transactions [1]. The message is structured
> > in 2
> > > > > parts.
> > > > > > > > First
> > > > > > > > > > > > is an analysis of the current state of affairs and
> > possible
> > > > > > > > options to
> > > > > > > > > > > > go. Second is a proposed option. First part is going
> > to be
> > > > > not so
> > > > > > > > > > > > short so some might prefer to skip it.
> > > > > > > > > > > >
> > > > > > > > > > > > ANALYSIS
> > > > > > > > > > > > The immediate question is "why we cannot use an
> > existing
> > > > > deadlock
> > > > > > > > > > > > detector?". The differences between classic and MVCC
> > > > > transactions
> > > > > > > > > > > > implementation is the answer. Currently a collection of
> > > > > > > > IgniteTxEntry
> > > > > > > > > > > > is used for detection. But such collection is not
> > maintained
> > > > > for
> > > > > > > > MVCC
> > > > > > > > > > > > transactions. So, it will not work out of box.
> > > > > > > > > > > > Also it looks like that current distributed iterative
> > > > > approach
> > > > > > > > cannot
> > > > > > > > > > > > be low latency it the worst case because of doing
> > possibly
> > > > > many
> > > > > > > > > > > > network requests sequentially.
> > > > > > > > > > > > So, what options do we have? Generally we should choose
> > > > > between
> > > > > > > > > > > > centralized and distributed approaches. By centralized
> > > > > approach I
> > > > > > > > mean
> > > > > > > > > > > > existence of a dedicated deadlock detector located on a
> > > > > single
> > > > > > > > node.
> > > > > > > > > > > > In the centralized approach we can face difficulties
> > related
> > > > > to
> > > > > > > > > > > > failover as a node running deadlock detector can fail.
> > In the
> > > > > > > > > > > > distributed approach extra network messaging overhead
> > can
> > > > > strike
> > > > > > > > > > > > because different nodes participating in a deadlock
> > can start
> > > > > > > > > > > > detection independently and send redundant messages. I
> > see
> > > > > some
> > > > > > > > > > > > aspects which make sense for choosing implementation.
> > Here
> > > > > they are
> > > > > > > > > > > > with an approach that is better (roughly speaking) in
> > > > > parentheses:
> > > > > > > > > > > > * Detection latency (centralized).
> > > > > > > > > > > > * Messaging overhead (centralized).
> > > > > > > > > > > > * Failover (distributed).
> > > > > > > > > > > > And also having a park of deadlock detectors sounds
> > not very
> > > > > good.
> > > > > > > > I
> > > > > > > > > > > > hope that it is possible to develop a common solution
> > > > > suitable for
> > > > > > > > > > > > both kinds of transactions. I suggest to pilot new
> > solution
> > > > > with
> > > > > > > > MVCC
> > > > > > > > > > > > and then adopt it for classic transactions.
> > > > > > > > > > > >
> > > > > > > > > > > > PROPOSAL
> > > > > > > > > > > > Actually I propose to start with an centralized
> > algorithm
> > > > > > > > described by
> > > > > > > > > > > > Vladimir in the beginning of the thread. I will try to
> > > > > outline main
> > > > > > > > > > > > points of it.
> > > > > > > > > > > > 1. Single deadlock detector exists in the cluster which
> > > > > maintains
> > > > > > > > > > > > transaction wait-for graph (WFG).
> > > > > > > > > > > > 2. Each cluster node sends and invalidates wait-for
> > edges to
> > > > > the
> > > > > > > > > > detector.
> > > > > > > > > > > > 3. The detector periodically searches cycles in WFG and
> > > > > chooses and
> > > > > > > > > > > > aborts a victim transaction if cycle is found.
> > > > > > > > > > > >
> > > > > > > > > > > > Currently I have one fundamental question. Is there a
> > > > > possibility
> > > > > > > > of
> > > > > > > > > > > > false detected deadlocks because of concurrent WFG
> > updates?
> > > > > > > > > > > > Of course there are many points of improvements and
> > > > > optimizations.
> > > > > > > > But
> > > > > > > > > > > > I would like to start from discussing key points.
> > > > > > > > > > > >
> > > > > > > > > > > > Please share your thoughts!
> > > > > > > > > > > >
> > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9322
> > > > > > > > > > > > ср, 14 нояб. 2018 г. в 15:47, ipavlukhin <
> > > > > vololo...@gmail.com>:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Hi Igniters,
> > > > > > > > > > > > >
> > > > > > > > > > > > > I would like to resume the discussion about a
> > deadlock
> > > > > detector.
> > > > > > > > I
> > > > > > > > > > start
> > > > > > > > > > > > > with a motivation for a further work on a subject.
> > As I see
> > > > > > > > current
> > > > > > > > > > > > > implementation (entry point
> > IgniteTxManager.detectDeadlock)
> > > > > > > > starts a
> > > > > > > > > > > > > detection only after a transaction was timed out. In
> > my
> > > > > mind it
> > > > > > > > is
> > > > > > > > > > not
> > > > > > > > > > > > > very good from a product usability standpoint. As you
> > > > > know, in a
> > > > > > > > > > > > > situation of deadlock some keys become non-usable
> > for an
> > > > > infinite
> > > > > > > > > > amount
> > > > > > > > > > > > > of time. Currently the only way to work around it is
> > > > > configuring
> > > > > > > > a
> > > > > > > > > > > > > timeout, but it could be rather tricky in practice to
> > > > > choose a
> > > > > > > > > > > > > proper/universal value for it. So, I see the main
> > point as:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Ability to break deadlocks without a need to
> > configure
> > > > > timeouts
> > > > > > > > > > > > explicitly.
> > > > > > > > > > > > >
> > > > > > > > > > > > > I will return soon with some thoughts about
> > implementation.
> > > > > > > > > > Meanwhile,
> > > > > > > > > > > > > does anybody have in mind any other usability points
> > which
> > > > > I am
> > > > > > > > > > missing?
> > > > > > > > > > > > > Or is there any alternative approaches?
> > > > > > > > > > > > >
> > > > > > > > > > > > > On 2017/11/21 08:32:02, Dmitriy Setrakyan <
> > d...@apache.org
> > > > > >
> > > > > > > > wrote:
> > > > > > > > > > > > >  > On Mon, Nov 20, 2017 at 10:15 PM, Vladimir Ozerov
> > <
> > > > > > > > > > vo...@gridgain.com
> > > > > > > > > > > > >>
> > > > > > > > > > > > >  > wrote:>
> > > > > > > > > > > > >  >
> > > > > > > > > > > > >  > > It doesn’t need all txes. Instead, other nodes
> > will
> > > > > send
> > > > > > > > info
> > > > > > > > > > about>
> > > > > > > > > > > > >  > > suspicious txes to it from time to time.>
> > > > > > > > > > > > >  > >>
> > > > > > > > > > > > >  >
> > > > > > > > > > > > >  > I see your point, I think it might work.>
> > > > > > > > > > > > >  >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best regards,
> > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best regards,
> > > > > > > > Ivan Pavlukhin
> > > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > > > --
> > > > > > Best regards,
> > > > > > Ivan Pavlukhin
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Ivan Pavlukhin
> > > > >
> > >
> > >
> > >
> > > --
> > > Best regards,
> > > Ivan Pavlukhin
> >
> >
> >
> > --
> > Best regards,
> > Ivan Pavlukhin
> >




-- 
Best regards,
Ivan Pavlukhin

Re: Suggestion to improve deadlock detection

Reply via email to