Hello, Guys. I'm reviewed changes and it looks good to me. There is a simple reproducer for a bug in test framework, see below.
It fails in master and works in branch. I'm planning to merge the fix [1] if Run All will be OK. Please, write to me if you have any objections. [1] https://github.com/apache/ignite/pull/2382 ``` public class MultiJvmSelfTest extends GridCommonAbstractTest { @Override protected boolean isMultiJvm() { return true; } public void testGrid() throws Exception { final IgniteInternalFuture fut = GridTestUtils.runAsync(new RunnableX() { @Override public void runx() throws Exception { try { startGrid(0); startGrid(1); } finally { stopGrid(1); stopGrid(0); } } }); try { fut.get(20_000L); } finally { stopAllGrids(true); } } } ``` В Чт, 15/03/2018 в 15:59 +0000, Dmitry Pavlov пишет: > I see now. Thank you. > > Nikolay, could you please merge this change? > > чт, 15 мар. 2018 г. в 18:48, Vyacheslav Daradur <daradu...@gmail.com>: > > > In brief: > > Nodes in *separate* JVMs are shutting down by the computing task > > *StopGridTask* which has sent from *local* JVM *synchronously* that > > means *local* node must wait for task's finish. > > > > At the same time when a node in *separate* JVM executes the received > > *StopGridTask* which *synchronously* calls *G.stop(igniteInstanceName, > > FALSE)* which is waiting for all computing task's finish, including > > *StopGridTask* which has invoked it. > > > > We have some kind of deadlock: > > *Local* node is waiting for the computing task's finish which is > > waiting for finish of execution *G.stop* which is waiting for all > > computing tasks finish including *StopGridTask*. > > > > We have not noticed that before because we use only stopAllGrids() in > > out tests which stop local JVM without waiting for nodes in other > > JVMs. > > > > > > > > On Thu, Mar 15, 2018 at 6:11 PM, Dmitry Pavlov <dpavlov....@gmail.com> > > wrote: > > > Please address comments in PR. > > > > > > I did not fully understood why sync GridStopMessage message was lost, but > > > async will be successfull. Probably we need discuss it briefly. > > > > > > чт, 1 мар. 2018 г. в 12:11, Vyacheslav Daradur <daradu...@gmail.com>: > > > > > > > > Thank you, Dmitry! > > > > > > > > I'll join this review soon. > > > > > > > > On Thu, Mar 1, 2018 at 12:07 PM, Dmitry Pavlov <dpavlov....@gmail.com> > > > > wrote: > > > > > Hi Vyacheslav, > > > > > > > > > > I will take a look, but first of all I am going to review > > > > > https://reviews.ignite.apache.org/ignite/review/IGNT-CR-502 - it is > > > > > impact > > > > > change in testing framework. Hope you also will join to this review . > > > > > > > > > > Sincerely, > > > > > Dmitiry Pavlov > > > > > > > > > > > > > > > чт, 1 мар. 2018 г. в 11:13, Vyacheslav Daradur <daradu...@gmail.com>: > > > > > > > > > > > > Hi, Dmitry, could you please review it, because you are one of the > > > > > > most experienced people in the testing framework. > > > > > > > > > > > > Please see comment in Jira, because it is in pretty-format there. > > > > > > > > > > > > On Thu, Feb 22, 2018 at 11:56 AM, Vyacheslav Daradur > > > > > > <daradu...@gmail.com> wrote: > > > > > > > Hi Igniters! > > > > > > > > > > > > > > I have investigated the issue [1] and found that stopping node in > > > > > > > separate JVM may stuck thread or leave system process alive after > > > > > > > test > > > > > > > finished. > > > > > > > The main reason is *StopGridTask* that we send from node in local > > > > JVM > > > > > > > to node in separate JVM via remote computing. > > > > > > > We send job synchronously to be sure that node will be stopped, > > > > > > > but > > > > > > > job calls synchronously *G.stop(igniteInstanceName, cancel))* with > > > > > > > *cancel = false*, that means node must wait to compute jobs before > > > > it > > > > > > > goes down what leads to some kind of deadlock. Using of *cancel = > > > > > > > true* would solve the issue but may break some tests’ logic, for > > > > this > > > > > > > reason, I've reworked the method’s synchronization logic [2]. > > > > > > > > > > > > > > We have not noticed that before because we use only > > > > *stopAllGrids()* > > > > > > > in out tests which stop local JVM without waiting for nodes in > > > > other > > > > > > > JVMs. > > > > > > > I believe this fix should reduce the number of flaky tests on > > > > > > > TeamCity, especially which fails because of a cluster from the > > > > > > > previous test has not been stopped properly. > > > > > > > > > > > > > > Ci.tests [3] look a bit better than in master. > > > > > > > Please review prepared PR [2] and share your thoughts. > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910 > > > > > > > [2] https://github.com/apache/ignite/pull/2382 > > > > > > > [3] https://ci.ignite.apache.org/viewLog.html?buildId=1105939 > > > > > > > > > > > > > > > > > > > > > On Fri, Aug 4, 2017 at 11:41 AM, Vyacheslav Daradur > > > > > > > <daradu...@gmail.com> wrote: > > > > > > > > Hi Igniters, > > > > > > > > > > > > > > > > Working on my task I found a bug at call the method > > > > #stopGrid(name), > > > > > > > > it produced ClassCastException. I created a ticket[1]. > > > > > > > > > > > > > > > > After it was fixed[2] I saw that nodes which was started in a > > > > > > > > separate > > > > > > > > JVM > > > > > > > > could stay in process of operation system. > > > > > > > > It was fixed too, but not sure is it fixed in proper way or not. > > > > > > > > > > > > > > > > Could someone review it? > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5910 > > > > > > > > [2] https://github.com/apache/ignite/pull/2382 > > > > > > > > > > > > > > > > -- > > > > > > > > Best Regards, Vyacheslav D. > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best Regards, Vyacheslav D. > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best Regards, Vyacheslav D. > > > > > > > > > > > > > > > > -- > > > > Best Regards, Vyacheslav D. > > > > > > > > -- > > Best Regards, Vyacheslav D. > >
signature.asc
Description: This is a digitally signed message part