[jira] [Created] (IGNITE-8821) Huge logs on BPlusTreeSelfTest put/remove family tests

2018-06-18 Thread Dmitriy Sorokin (JIRA)
Dmitriy Sorokin created IGNITE-8821:
---

 Summary: Huge logs on BPlusTreeSelfTest put/remove family tests
 Key: IGNITE-8821
 URL: https://issues.apache.org/jira/browse/IGNITE-8821
 Project: Ignite
  Issue Type: Test
  Components: general
Reporter: Dmitriy Sorokin
Assignee: Dmitriy Sorokin
 Fix For: 2.6


A printLocks method generates huge count of ## XX log lines without any 
more info assigned to. Avoiding the output of unnecessary non-informative lines 
is suggested.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8769) JVM crash in Basic1 suite in master branch on TC

2018-06-18 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16515680#comment-16515680
 ] 

Dmitriy Sorokin commented on IGNITE-8769:
-

[~ivan.glukos], review my patch, please!

> JVM crash in Basic1 suite in master branch on TC
> 
>
> Key: IGNITE-8769
> URL: https://issues.apache.org/jira/browse/IGNITE-8769
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Latest build with crash: [TC 
> link|https://ci.ignite.apache.org/viewLog.html?buildId=1373991&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1]
> There is another crash in the history: [TC 
> link|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Basic1&branch_IgniteTests24Java8=%3Cdefault%3E&tab=buildTypeStatusDiv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8749) Exception for "no space left" situation should be propagated to FailureHandler

2018-06-15 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16513761#comment-16513761
 ] 

Dmitriy Sorokin commented on IGNITE-8749:
-

[~agura], review my patch, please!

> Exception for "no space left" situation should be propagated to FailureHandler
> --
>
> Key: IGNITE-8749
> URL: https://issues.apache.org/jira/browse/IGNITE-8749
> Project: Ignite
>  Issue Type: Improvement
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Major
> Fix For: 2.6
>
>
> For now if "no space left" situation is detected in 
> FileWriteAheadLogManager#formatFile method and corresponding exception is 
> thrown the exception doesn't get propagated to FailureHandler and node 
> continues working.
> As "no space left" is a critical situation, corresponding exception should be 
> propagated to handler to make necessary actions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-8769) JVM crash in Basic1 suite in master branch on TC

2018-06-14 Thread Dmitriy Sorokin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8769?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-8769:
---

Assignee: Dmitriy Sorokin

> JVM crash in Basic1 suite in master branch on TC
> 
>
> Key: IGNITE-8769
> URL: https://issues.apache.org/jira/browse/IGNITE-8769
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Latest build with crash: [TC 
> link|https://ci.ignite.apache.org/viewLog.html?buildId=1373991&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_Basic1]
> There is another crash in the history: [TC 
> link|https://ci.ignite.apache.org/viewType.html?buildTypeId=IgniteTests24Java8_Basic1&branch_IgniteTests24Java8=%3Cdefault%3E&tab=buildTypeStatusDiv]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-8749) Exception for "no space left" situation should be propagated to FailureHandler

2018-06-13 Thread Dmitriy Sorokin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-8749:
---

Assignee: Dmitriy Sorokin  (was: Andrey Gura)

> Exception for "no space left" situation should be propagated to FailureHandler
> --
>
> Key: IGNITE-8749
> URL: https://issues.apache.org/jira/browse/IGNITE-8749
> Project: Ignite
>  Issue Type: Improvement
>  Components: persistence
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Major
> Fix For: 2.6
>
>
> For now if "no space left" situation is detected in 
> FileWriteAheadLogManager#formatFile method and corresponding exception is 
> thrown the exception doesn't get propagated to FailureHandler and node 
> continues working.
> As "no space left" is a critical situation, corresponding exception should be 
> propagated to handler to make necessary actions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-8742) Direct IO 2 suite is timed out by 'out of disk space' failure emulation test: WAL manager failure does not stoped execution

2018-06-08 Thread Dmitriy Sorokin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-8742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-8742:
---

Assignee: Dmitriy Sorokin

> Direct IO 2 suite is timed out by 'out of disk space' failure emulation test: 
> WAL manager failure does not stoped execution
> ---
>
> Key: IGNITE-8742
> URL: https://issues.apache.org/jira/browse/IGNITE-8742
> Project: Ignite
>  Issue Type: Test
>  Components: persistence
>Reporter: Dmitriy Pavlov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> https://ci.ignite.apache.org/viewLog.html?buildId=1366882&tab=buildResultsDiv&buildTypeId=IgniteTests24Java8_PdsDirectIo2
> Test 
> org.apache.ignite.internal.processors.cache.persistence.IgniteNativeIoWalFlushFsyncSelfTest#testFailAfterStart
> emulates problem with disc space using exception.
> In direct IO environment real IO with disk is performed, tmpfs is not used.
> Sometimes this error can come from rollover() of segment, failure handler 
> reacted accordingly.
> {noformat}
> detected. Will be handled accordingly to configured handler [hnd=class 
> o.a.i.failure.StopNodeFailureHandler, failureCtx=FailureContext 
> [type=CRITICAL_ERROR, err=class o.a.i.i.pagemem.wal.StorageException: Unable 
> to write]]
> class org.apache.ignite.internal.pagemem.wal.StorageException: Unable to write
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.writeBuffer(FsyncModeFileWriteAheadLogManager.java:2964)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.flush(FsyncModeFileWriteAheadLogManager.java:2640)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.flush(FsyncModeFileWriteAheadLogManager.java:2572)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.flushOrWait(FsyncModeFileWriteAheadLogManager.java:2525)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.close(FsyncModeFileWriteAheadLogManager.java:2795)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.access$700(FsyncModeFileWriteAheadLogManager.java:2340)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager.rollOver(FsyncModeFileWriteAheadLogManager.java:1029)
>   at 
> org.apache.ignite.internal.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager.log(FsyncModeFileWriteAheadLogManager.java:673)
> {noformat}
> But test seems to be not able to stop, node stopper thread tries to stop 
> cache, flush WAL. flush wait for rollover, which will never happen.
> {noformat}
> Thread [name="node-stopper", id=2836, state=WAITING, blockCnt=7, waitCnt=9]
> Lock 
> [object=java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject@47f6473,
>  ownerName=null, ownerId=-1]
> at sun.misc.Unsafe.park(Native Method)
> at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
> at o.a.i.i.util.IgniteUtils.awaitQuiet(IgniteUtils.java:7473)
> at 
> o.a.i.i.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.flushOrWait(FsyncModeFileWriteAheadLogManager.java:2546)
> at 
> o.a.i.i.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.fsync(FsyncModeFileWriteAheadLogManager.java:2750)
> at 
> o.a.i.i.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager$FileWriteHandle.access$2000(FsyncModeFileWriteAheadLogManager.java:2340)
> at 
> o.a.i.i.processors.cache.persistence.wal.FsyncModeFileWriteAheadLogManager.flush(FsyncModeFileWriteAheadLogManager.java:699)
> at 
> o.a.i.i.processors.cache.GridCacheProcessor.stopCache(GridCacheProcessor.java:1243)
> at 
> o.a.i.i.processors.cache.GridCacheProcessor.stopCaches(GridCacheProcessor.java:969)
> at 
> o.a.i.i.processors.cache.GridCacheProcessor.stop(GridCacheProcessor.java:943)
> at o.a.i.i.IgniteKernal.stop0(IgniteKernal.java:2289)
> at o.a.i.i.IgniteKernal.stop(IgniteKernal.java:2167)
> at o.a.i.i.IgnitionEx$IgniteNamedInstance.stop0(IgnitionEx.java:2588)
> - locked o.a.i.i.IgnitionEx$IgniteNamedInstance@90f6bfd
> at o.a.i.i.IgnitionEx$Igni

[jira] [Commented] (IGNITE-8311) IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to terminate via NPE

2018-06-04 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16500129#comment-16500129
 ] 

Dmitriy Sorokin commented on IGNITE-8311:
-

[~agura], review my patch, please!

> IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to 
> terminate via NPE
> ---
>
> Key: IGNITE-8311
> URL: https://issues.apache.org/jira/browse/IGNITE-8311
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Andrey Kuznetsov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Currently, tests use {{NoOpFailureHandler}} by default, hence this 
> exchange-worker termination is masked. We are to fix it: test code should not 
> be able to terminate system-critical thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-8311) IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to terminate via NPE

2018-05-31 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496627#comment-16496627
 ] 

Dmitriy Sorokin edited comment on IGNITE-8311 at 5/31/18 2:26 PM:
--

The root cause of this error is inconsistent state of 
GridAffinityAssignmentCache, which appears when main cycle of 
ExchangeWorker.body0() continue work on occuring of IgniteCheckedException 
(last catch block at the end of cycle).

Proposed solution - further throwing of catched exception, it will prevent 
going into inconsistent state of grid components.

However, some tests of starting grid with incorrect configuration will cause 
the jvm halt due to critical system error detected, but that issue should be 
fixed in IGNITE-1094.


was (Author: cyberdemon):
The root cause of this error is inconsistent state of 
GridAffinityAssignmentCache, which appears when main cycle of 
ExchangeWorker.body0() continue work on occuring of IgniteCheckedException 
(last catch block at the end of cycle).

Proposed solution - further throwing of catched exception, it will prevent 
going into inconsistent state of grid components.

However, some tests of starting grid with incorrect configuration will cause 
the jvm halt due to critical system error detected, but that issue should be 
fixed in -IGNITE-1094-.

> IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to 
> terminate via NPE
> ---
>
> Key: IGNITE-8311
> URL: https://issues.apache.org/jira/browse/IGNITE-8311
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Andrey Kuznetsov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Currently, tests use {{NoOpFailureHandler}} by default, hence this 
> exchange-worker termination is masked. We are to fix it: test code should not 
> be able to terminate system-critical thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (IGNITE-8311) IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to terminate via NPE

2018-05-31 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496627#comment-16496627
 ] 

Dmitriy Sorokin edited comment on IGNITE-8311 at 5/31/18 2:25 PM:
--

The root cause of this error is inconsistent state of 
GridAffinityAssignmentCache, which appears when main cycle of 
ExchangeWorker.body0() continue work on occuring of IgniteCheckedException 
(last catch block at the end of cycle).

Proposed solution - further throwing of catched exception, it will prevent 
going into inconsistent state of grid components.

However, some tests of starting grid with incorrect configuration will cause 
the jvm halt due to critical system error detected, but that issue should be 
fixed in -IGNITE-1094-.


was (Author: cyberdemon):
The root cause of this error is inconsistent state of 
GridAffinityAssignmentCache, which appears when main cycle of 
ExchangeWorker.body0() continue work on occuring of IgniteCheckedException 
(last catch block at the end of cycle).

Proposed solution - further throwing of catched exception, it will prevent 
going into inconsistent state of grid components.

However, some tests of starting grid with incorrect configuration will cause 
the jvm halt due to critical system error detected, but that issue should be 
fixed in IGNITE-1049.

> IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to 
> terminate via NPE
> ---
>
> Key: IGNITE-8311
> URL: https://issues.apache.org/jira/browse/IGNITE-8311
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Andrey Kuznetsov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Currently, tests use {{NoOpFailureHandler}} by default, hence this 
> exchange-worker termination is masked. We are to fix it: test code should not 
> be able to terminate system-critical thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8311) IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to terminate via NPE

2018-05-31 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496627#comment-16496627
 ] 

Dmitriy Sorokin commented on IGNITE-8311:
-

The root cause of this error is inconsistent state of 
GridAffinityAssignmentCache, which appears when main cycle of 
ExchangeWorker.body0() continue work on occuring of IgniteCheckedException 
(last catch block at the end of cycle).

Proposed solution - further throwing of catched exception, it will prevent 
going into inconsistent state of grid components.

However, some tests of starting grid with incorrect configuration will cause 
the jvm halt due to critical system error detected, but that issue should be 
fixed in IGNITE-1049.

> IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to 
> terminate via NPE
> ---
>
> Key: IGNITE-8311
> URL: https://issues.apache.org/jira/browse/IGNITE-8311
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Andrey Kuznetsov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Currently, tests use {{NoOpFailureHandler}} by default, hence this 
> exchange-worker termination is masked. We are to fix it: test code should not 
> be able to terminate system-critical thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IGNITE-5997) [Test Failed] DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange

2018-05-31 Thread Dmitriy Sorokin (JIRA)


 [ 
https://issues.apache.org/jira/browse/IGNITE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin resolved IGNITE-5997.
-
Resolution: Cannot Reproduce

> [Test Failed] 
> DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange
> ---
>
> Key: IGNITE-5997
> URL: https://issues.apache.org/jira/browse/IGNITE-5997
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Eduard Shangareev
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> It fails more often locally on linux machine
> http://ci.ignite.apache.org/viewLog.html?buildId=752869&tab=buildResultsDiv&buildTypeId=Ignite20Tests_IgniteQueries2#testNameId-4226597044755906475
> {code}
> SchemaOperationException [code=0, msg=Client node is disconnected (operation 
> result is unknown).]
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.onDisconnected(GridQueryProcessor.java:822)
>   at 
> org.apache.ignite.internal.IgniteKernal.onDisconnected(IgniteKernal.java:3770)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:749)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery(GridDiscoveryManager.java:559)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2391)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2370)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1686)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-5997) [Test Failed] DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange

2018-05-31 Thread Dmitriy Sorokin (JIRA)


[ 
https://issues.apache.org/jira/browse/IGNITE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16496455#comment-16496455
 ] 

Dmitriy Sorokin commented on IGNITE-5997:
-

I ran this test about 40 times, and no failures was happen.
In addition, [TC history of this 
test|https://ci.ignite.apache.org/project.html?tab=testDetails&projectId=IgniteTests24Java8&testNameId=-4226597044755906475&page=1]
 not contain any run in which one has been failed.
So, I think that this ticket should be closed as non-reproducable.

> [Test Failed] 
> DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange
> ---
>
> Key: IGNITE-5997
> URL: https://issues.apache.org/jira/browse/IGNITE-5997
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Eduard Shangareev
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> It fails more often locally on linux machine
> http://ci.ignite.apache.org/viewLog.html?buildId=752869&tab=buildResultsDiv&buildTypeId=Ignite20Tests_IgniteQueries2#testNameId-4226597044755906475
> {code}
> SchemaOperationException [code=0, msg=Client node is disconnected (operation 
> result is unknown).]
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.onDisconnected(GridQueryProcessor.java:822)
>   at 
> org.apache.ignite.internal.IgniteKernal.onDisconnected(IgniteKernal.java:3770)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:749)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery(GridDiscoveryManager.java:559)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2391)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2370)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1686)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8584) Provide ability to terminate any thread with enabled test features

2018-05-23 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487807#comment-16487807
 ] 

Dmitriy Sorokin commented on IGNITE-8584:
-

[~agura], review my patch, please!

> Provide ability to terminate any thread with enabled test features
> --
>
> Key: IGNITE-8584
> URL: https://issues.apache.org/jira/browse/IGNITE-8584
> Project: Ignite
>  Issue Type: New Feature
>Reporter: Andrey Gura
>Assignee: Dmitriy Sorokin
>Priority: Major
> Fix For: 2.6
>
>
> We already have {{WorkersControlMXBean}} that provides possibility to 
> interrupt thread registered in system workers registry. We also want stop any 
> thread in the system for testing purposes.
> Method {{stop(String threadName)}} should be added that have to find thread 
> by name and stop it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-8584) Provide ability to terminate any thread with enabled test features

2018-05-23 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-8584:
---

Assignee: Dmitriy Sorokin

> Provide ability to terminate any thread with enabled test features
> --
>
> Key: IGNITE-8584
> URL: https://issues.apache.org/jira/browse/IGNITE-8584
> Project: Ignite
>  Issue Type: New Feature
>Reporter: Andrey Gura
>Assignee: Dmitriy Sorokin
>Priority: Major
> Fix For: 2.6
>
>
> We already have {{WorkersControlMXBean}} that provides possibility to 
> interrupt thread registered in system workers registry. We also want stop any 
> thread in the system for testing purposes.
> Method {{stop(String threadName)}} should be added that have to find thread 
> by name and stop it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-5997) [Test Failed] DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange

2018-05-14 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5997:
---

Assignee: Dmitriy Sorokin

> [Test Failed] 
> DynamicIndexPartitionedAtomicConcurrentSelfTest.testCoordinatorChange
> ---
>
> Key: IGNITE-5997
> URL: https://issues.apache.org/jira/browse/IGNITE-5997
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.1
>Reporter: Eduard Shangareev
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
>
> It fails more often locally on linux machine
> http://ci.ignite.apache.org/viewLog.html?buildId=752869&tab=buildResultsDiv&buildTypeId=Ignite20Tests_IgniteQueries2#testNameId-4226597044755906475
> {code}
> SchemaOperationException [code=0, msg=Client node is disconnected (operation 
> result is unknown).]
>   at 
> org.apache.ignite.internal.processors.query.GridQueryProcessor.onDisconnected(GridQueryProcessor.java:822)
>   at 
> org.apache.ignite.internal.IgniteKernal.onDisconnected(IgniteKernal.java:3770)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery0(GridDiscoveryManager.java:749)
>   at 
> org.apache.ignite.internal.managers.discovery.GridDiscoveryManager$4.onDiscovery(GridDiscoveryManager.java:559)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2391)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.notifyDiscovery(ClientImpl.java:2370)
>   at 
> org.apache.ignite.spi.discovery.tcp.ClientImpl$MessageWorker.body(ClientImpl.java:1686)
>   at org.apache.ignite.spi.IgniteSpiThread.run(IgniteSpiThread.java:62)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-8311) IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to terminate via NPE

2018-04-26 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-8311:
---

Assignee: Dmitriy Sorokin

> IgniteClientRejoinTest.testClientsReconnectDisabled causes exchange-worker to 
> terminate via NPE
> ---
>
> Key: IGNITE-8311
> URL: https://issues.apache.org/jira/browse/IGNITE-8311
> Project: Ignite
>  Issue Type: Bug
>Affects Versions: 2.4
>Reporter: Andrey Kuznetsov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: MakeTeamcityGreenAgain
> Fix For: 2.6
>
>
> Currently, tests use {{NoOpFailureHandler}} by default, hence this 
> exchange-worker termination is masked. We are to fix it: test code should not 
> be able to terminate system-critical thread.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-4958) Make data pages recyclable into index/meta/etc pages and vice versa

2018-04-25 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-4958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16452571#comment-16452571
 ] 

Dmitriy Sorokin commented on IGNITE-4958:
-

[~agura], [~ivan.glukos], review my patch, please, test results seems good for 
me.

> Make data pages recyclable into index/meta/etc pages and vice versa
> ---
>
> Key: IGNITE-4958
> URL: https://issues.apache.org/jira/browse/IGNITE-4958
> Project: Ignite
>  Issue Type: Improvement
>  Components: cache
>Affects Versions: 2.0
>Reporter: Ivan Rakov
>Assignee: Dmitriy Sorokin
>Priority: Major
> Fix For: 2.6
>
>
> Recycling for data pages is disabled for now. Empty data pages are 
> accumulated in FreeListImpl#emptyDataPagesBucket, and can be reused only as 
> data pages again. What has to be done:
> * Empty data pages should be recycled into reuse bucket
> * We should check reuse bucket first before allocating a new data page
> * MemoryPolicyConfiguration#emptyPagesPoolSize should be removed



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8255) Possible name collisions in WorkersRegistry

2018-04-13 Thread Dmitriy Sorokin (JIRA)
Dmitriy Sorokin created IGNITE-8255:
---

 Summary: Possible name collisions in WorkersRegistry
 Key: IGNITE-8255
 URL: https://issues.apache.org/jira/browse/IGNITE-8255
 Project: Ignite
  Issue Type: Bug
Reporter: Dmitriy Sorokin
Assignee: Dmitriy Sorokin
 Fix For: 2.5


 
{code:java}
java.lang.IllegalStateException: Worker is already registered 
[worker=GridWorker [name=ttl-cleanup-worker, igniteInstanceName=null, 
finished=false, hashCode=612569625, interrupted=true, 
runner=ttl-cleanup-worker-#66]]
at 
org.apache.ignite.internal.worker.WorkersRegistry.register(WorkersRegistry.java:40)
at 
org.apache.ignite.internal.worker.WorkersRegistry.onStarted(WorkersRegistry.java:73)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:108)
at java.lang.Thread.run(Thread.java:748){code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8101) Ability to terminate system workers by JMX for test purposes

2018-04-05 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16426833#comment-16426833
 ] 

Dmitriy Sorokin commented on IGNITE-8101:
-

[~agura], review my patch, please!

> Ability to terminate system workers by JMX for test purposes
> 
>
> Key: IGNITE-8101
> URL: https://issues.apache.org/jira/browse/IGNITE-8101
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Dmitriy Sorokin
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8101) Ability to terminate system workers by JMX for test purposes

2018-04-05 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-8101:

Description: (was: [0] disco-event-worker [0] 
 [1] grid-timeout-worker [1] 
 [2] partition-exchanger [2])

> Ability to terminate system workers by JMX for test purposes
> 
>
> Key: IGNITE-8101
> URL: https://issues.apache.org/jira/browse/IGNITE-8101
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Dmitriy Sorokin
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8101) Ability to terminate system workers by JMX for test purposes

2018-04-05 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8101?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-8101:

Description: 
[0] disco-event-worker [0] 
 [1] grid-timeout-worker [1] 
 [2] partition-exchanger [2]

> Ability to terminate system workers by JMX for test purposes
> 
>
> Key: IGNITE-8101
> URL: https://issues.apache.org/jira/browse/IGNITE-8101
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Dmitriy Sorokin
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>
> [0] disco-event-worker [0] 
>  [1] grid-timeout-worker [1] 
>  [2] partition-exchanger [2]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-8071) Add tests for failure handlers

2018-04-02 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-8071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16422602#comment-16422602
 ] 

Dmitriy Sorokin commented on IGNITE-8071:
-

[~agura], review my patch, please!

> Add tests for failure handlers
> --
>
> Key: IGNITE-8071
> URL: https://issues.apache.org/jira/browse/IGNITE-8071
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Andrey Gura
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>
> Different failure handlers were implemented due to IEP-14 (IGNITE-6890). 
> Tests should be added for this implementations.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8101) Ability to terminate system workers by JMX for test purposes

2018-04-02 Thread Dmitriy Sorokin (JIRA)
Dmitriy Sorokin created IGNITE-8101:
---

 Summary: Ability to terminate system workers by JMX for test 
purposes
 Key: IGNITE-8101
 URL: https://issues.apache.org/jira/browse/IGNITE-8101
 Project: Ignite
  Issue Type: Improvement
Reporter: Dmitriy Sorokin
Assignee: Dmitriy Sorokin
 Fix For: 2.5






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8068) Some tests failed due to JVM halt by default FailureHandler

2018-03-28 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-8068:

Fix Version/s: 2.5

> Some tests failed due to JVM halt by default FailureHandler
> ---
>
> Key: IGNITE-8068
> URL: https://issues.apache.org/jira/browse/IGNITE-8068
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Sorokin
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>
> NoOpFailureHandler is needed by default in test IgniteConfiguration instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-8068) Some tests failed due to JVM halt by default FailureHandler

2018-03-28 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-8068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-8068:

Labels: iep-14  (was: )

> Some tests failed due to JVM halt by default FailureHandler
> ---
>
> Key: IGNITE-8068
> URL: https://issues.apache.org/jira/browse/IGNITE-8068
> Project: Ignite
>  Issue Type: Bug
>Reporter: Dmitriy Sorokin
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-14
> Fix For: 2.5
>
>
> NoOpFailureHandler is needed by default in test IgniteConfiguration instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (IGNITE-8068) Some tests failed due to JVM halt by default FailureHandler

2018-03-28 Thread Dmitriy Sorokin (JIRA)
Dmitriy Sorokin created IGNITE-8068:
---

 Summary: Some tests failed due to JVM halt by default 
FailureHandler
 Key: IGNITE-8068
 URL: https://issues.apache.org/jira/browse/IGNITE-8068
 Project: Ignite
  Issue Type: Bug
Reporter: Dmitriy Sorokin
Assignee: Dmitriy Sorokin


NoOpFailureHandler is needed by default in test IgniteConfiguration instances.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-6890) General way for handling Ignite failures

2018-02-12 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360603#comment-16360603
 ] 

Dmitriy Sorokin commented on IGNITE-6890:
-

[~avinogradov], review my new patch, please.

> General way for handling Ignite failures
> 
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-7
> Fix For: 2.5
>
>
> Ignite failures which should be handled are:
>  # Topology segmentation;
>  # Exchange worker stop;
>  # Persistence errors.
> Proper behavior should be selected according to result of calling 
> IgniteFailureHandler instance, custom implementation of which can be provided 
> in IgniteConfiguration. It can be node stop, restart or nothing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (IGNITE-6891) Proper behavior on Persistence errors

2018-02-07 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin resolved IGNITE-6891.
-
Resolution: Duplicate

> Proper behavior on Persistence errors 
> --
>
> Key: IGNITE-6891
> URL: https://issues.apache.org/jira/browse/IGNITE-6891
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-7
> Fix For: 2.5
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-6890) General way for handling Ignite failures

2018-02-07 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-6890:

Summary: General way for handling Ignite failures  (was: Proper behavior on 
ExchangeWorker exits with error )

> General way for handling Ignite failures
> 
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-7
> Fix For: 2.5
>
>
> Ignite failures which should be handled are:
>  # Topology segmentation;
>  # Exchange worker stop;
>  # Persistence errors.
> Proper behavior should be selected according to result of calling 
> IgniteFailureHandler instance, custom implementation of which can be provided 
> in IgniteConfiguration. It can be node stop, restart or nothing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2018-02-07 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-6890:

Description: 
Ignite failures which should be handled are:
 # Topology segmentation;
 # Exchange worker stop;
 # Persistence errors.

Proper behavior should be selected according to result of calling 
IgniteFailureHandler instance, custom implementation of which can be provided 
in IgniteConfiguration. It can be node stop, restart or nothing.

  was:Node should be stopped anyway, what we can provide is user callback, 
something like beforeNodeStop'.


> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>Priority: Major
>  Labels: iep-7
> Fix For: 2.5
>
>
> Ignite failures which should be handled are:
>  # Topology segmentation;
>  # Exchange worker stop;
>  # Persistence errors.
> Proper behavior should be selected according to result of calling 
> IgniteFailureHandler instance, custom implementation of which can be provided 
> in IgniteConfiguration. It can be node stop, restart or nothing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-7019) Cluster can not survive after IgniteOOM

2018-01-29 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16343239#comment-16343239
 ] 

Dmitriy Sorokin commented on IGNITE-7019:
-

[~avinogradov],
Review my patch, please.

> Cluster can not survive after IgniteOOM
> ---
>
> Key: IGNITE-7019
> URL: https://issues.apache.org/jira/browse/IGNITE-7019
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Dmitriy Sorokin
>Priority: Critical
>  Labels: iep-7
> Fix For: 2.5
>
>
> even if we have full sync mode and transactional cache we can't add new nodes 
> if there  was IgniteOOM, after adding new nodes and re-balancing, old nodes 
> can't evict partitions:
> {code}
> [2017-11-17 20:02:24,588][ERROR][sys-#65%DR1%][GridDhtPreloader] Partition 
> eviction failed, this can cause grid hang.
> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Not enough 
> memory allocated [policyName=100MB_Region_Eviction, size=104.9 MB]
> Consider increasing memory policy size, enabling evictions, adding more nodes 
> to the cluster, reducing number of backups or reducing model size.
> at 
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:294)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePageNoReuse(DataStructure.java:117)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePage(DataStructure.java:105)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.addStripe(PagesList.java:413)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.getPageForPut(PagesList.java:528)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:617)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.addForRecycle(FreeListImpl.java:582)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.reuseFreePages(BPlusTree.java:3847)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.releaseAll(BPlusTree.java:4106)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.access$6900(BPlusTree.java:3166)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1782)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1567)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1387)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
> at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:892)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:750)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6639)
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Discussion on the dev list: 
> http://apache-ignite-developers.2346864.n4.nabble.com/How-properly-handle-IgniteOOM-td25288.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-7019) Cluster can not survive after IgniteOOM

2018-01-26 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16340947#comment-16340947
 ] 

Dmitriy Sorokin commented on IGNITE-7019:
-

Final solution which was coded is passing ReuseBag instance as parameter 
through PagesList's getPageForPut and addStripe methods to allocatePage method. 
That allows use ReuseBag's pages before trying to allocate new pages.

> Cluster can not survive after IgniteOOM
> ---
>
> Key: IGNITE-7019
> URL: https://issues.apache.org/jira/browse/IGNITE-7019
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Dmitriy Sorokin
>Priority: Critical
>  Labels: iep-7
> Fix For: 2.5
>
>
> even if we have full sync mode and transactional cache we can't add new nodes 
> if there  was IgniteOOM, after adding new nodes and re-balancing, old nodes 
> can't evict partitions:
> {code}
> [2017-11-17 20:02:24,588][ERROR][sys-#65%DR1%][GridDhtPreloader] Partition 
> eviction failed, this can cause grid hang.
> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Not enough 
> memory allocated [policyName=100MB_Region_Eviction, size=104.9 MB]
> Consider increasing memory policy size, enabling evictions, adding more nodes 
> to the cluster, reducing number of backups or reducing model size.
> at 
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:294)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePageNoReuse(DataStructure.java:117)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePage(DataStructure.java:105)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.addStripe(PagesList.java:413)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.getPageForPut(PagesList.java:528)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:617)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.addForRecycle(FreeListImpl.java:582)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.reuseFreePages(BPlusTree.java:3847)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.releaseAll(BPlusTree.java:4106)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.access$6900(BPlusTree.java:3166)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1782)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1567)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1387)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
> at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:892)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:750)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6639)
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Discussion on the dev list: 
> http://apache-ignite-developers.2346864.n4.nabble.com/How-properly-handle-IgniteOOM-td25288.html



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (IGNITE-7019) Cluster can not survive after IgniteOOM

2018-01-22 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16334085#comment-16334085
 ] 

Dmitriy Sorokin commented on IGNITE-7019:
-

We discussed possible solutions with [~mcherkasov] and [~avinogradov], and 
chose the following: first, when IOOME occured on page moving from bucket with 
lower index to higher one, we leave page on old bucket; second, we add 
periodical task for looking up such pages (placed on wrong buckets) and 
correcting its placement if possible (no IOOME on page moving).

Also we need reproducer for this bug, I'll make it at first.

> Cluster can not survive after IgniteOOM
> ---
>
> Key: IGNITE-7019
> URL: https://issues.apache.org/jira/browse/IGNITE-7019
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Dmitriy Sorokin
>Priority: Critical
>  Labels: iep-7
> Fix For: 2.5
>
>
> even if we have full sync mode and transactional cache we can't add new nodes 
> if there  was IgniteOOM, after adding new nodes and re-balancing, old nodes 
> can't evict partitions:
> {code}
> [2017-11-17 20:02:24,588][ERROR][sys-#65%DR1%][GridDhtPreloader] Partition 
> eviction failed, this can cause grid hang.
> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Not enough 
> memory allocated [policyName=100MB_Region_Eviction, size=104.9 MB]
> Consider increasing memory policy size, enabling evictions, adding more nodes 
> to the cluster, reducing number of backups or reducing model size.
> at 
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:294)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePageNoReuse(DataStructure.java:117)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePage(DataStructure.java:105)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.addStripe(PagesList.java:413)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.getPageForPut(PagesList.java:528)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:617)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.addForRecycle(FreeListImpl.java:582)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.reuseFreePages(BPlusTree.java:3847)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.releaseAll(BPlusTree.java:4106)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.access$6900(BPlusTree.java:3166)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1782)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1567)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1387)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
> at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:892)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:750)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6639)
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Discussion on the dev list: 
> http://apache-ignite-developers.2346864.n4.nabble.com/How-properly-handle-IgniteOOM-td25288.html



--
This message was sent by Atlassian 

[jira] [Assigned] (IGNITE-7019) Cluster can not survive after IgniteOOM

2018-01-12 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-7019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-7019:
---

Assignee: Dmitriy Sorokin

> Cluster can not survive after IgniteOOM
> ---
>
> Key: IGNITE-7019
> URL: https://issues.apache.org/jira/browse/IGNITE-7019
> Project: Ignite
>  Issue Type: Bug
>  Components: cache
>Affects Versions: 2.3
>Reporter: Mikhail Cherkasov
>Assignee: Dmitriy Sorokin
>Priority: Critical
>  Labels: iep-7
> Fix For: 2.4
>
>
> even if we have full sync mode and transactional cache we can't add new nodes 
> if there  was IgniteOOM, after adding new nodes and re-balancing, old nodes 
> can't evict partitions:
> {code}
> [2017-11-17 20:02:24,588][ERROR][sys-#65%DR1%][GridDhtPreloader] Partition 
> eviction failed, this can cause grid hang.
> class org.apache.ignite.internal.mem.IgniteOutOfMemoryException: Not enough 
> memory allocated [policyName=100MB_Region_Eviction, size=104.9 MB]
> Consider increasing memory policy size, enabling evictions, adding more nodes 
> to the cluster, reducing number of backups or reducing model size.
> at 
> org.apache.ignite.internal.pagemem.impl.PageMemoryNoStoreImpl.allocatePage(PageMemoryNoStoreImpl.java:294)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePageNoReuse(DataStructure.java:117)
> at 
> org.apache.ignite.internal.processors.cache.persistence.DataStructure.allocatePage(DataStructure.java:105)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.addStripe(PagesList.java:413)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.getPageForPut(PagesList.java:528)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.PagesList.put(PagesList.java:617)
> at 
> org.apache.ignite.internal.processors.cache.persistence.freelist.FreeListImpl.addForRecycle(FreeListImpl.java:582)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.reuseFreePages(BPlusTree.java:3847)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.releaseAll(BPlusTree.java:4106)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$Remove.access$6900(BPlusTree.java:3166)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doRemove(BPlusTree.java:1782)
> at 
> org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.remove(BPlusTree.java:1567)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl$CacheDataStoreImpl.remove(IgniteCacheOffheapManagerImpl.java:1387)
> at 
> org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.remove(IgniteCacheOffheapManagerImpl.java:374)
> at 
> org.apache.ignite.internal.processors.cache.GridCacheMapEntry.removeValue(GridCacheMapEntry.java:3233)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheEntry.clearInternal(GridDhtCacheEntry.java:588)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.clearAll(GridDhtLocalPartition.java:892)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition.tryEvict(GridDhtLocalPartition.java:750)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:593)
> at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader$3.call(GridDhtPreloader.java:580)
> at 
> org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6639)
> at 
> org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
> at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:748)
> {code}
> Discussion on the dev list: 
> http://apache-ignite-developers.2346864.n4.nabble.com/How-properly-handle-IgniteOOM-td25288.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6742) Java 9: rework Cleaner usage in PlatformMemoryPool class

2017-12-26 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16303773#comment-16303773
 ] 

Dmitriy Sorokin commented on IGNITE-6742:
-

[~agura], review mypatch, please. I think that Ignite basic tests passed is 
enough for this task.

> Java 9: rework Cleaner usage in PlatformMemoryPool class
> 
>
> Key: IGNITE-6742
> URL: https://issues.apache.org/jira/browse/IGNITE-6742
> Project: Ignite
>  Issue Type: Task
>  Components: platforms
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
> Fix For: 2.4
>
>
> We attach special cleaner to {{PlatformMemoryPool}} using 
> {{sun.misc.Cleaner.create}} method. This way we ensure that thread-local 
> native memory (which is used to pass data between platform and Java) is 
> released properly. 
> Need to rework this API to reflection-based approach, which works for both 
> Java 7/8 and Java 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6894) Hanged Tx monitoring

2017-12-25 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6894:
---

Assignee: Dmitriy Sorokin

> Hanged Tx monitoring
> 
>
> Key: IGNITE-6894
> URL: https://issues.apache.org/jira/browse/IGNITE-6894
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Hanging Transactions not Related to Deadlock
> Description
> This situation can occur if user explicitly markups the transaction (esp 
> Pessimistic Repeatable Read) and, for example, calls remote service (which 
> may be unresponsive) after acquiring some locks. All other transactions 
> depending on the same keys will hang.
> Detection and Solution
> This most likely cannot be resolved automatically other than rollback TX by 
> timeout and release all the locks acquired so far. Also such TXs can be 
> rolled back from Web Console as described above.
> If transaction has been rolled back on timeout or via UI then any further 
> action in the transaction, e.g. lock acquisition or commit attempt should 
> throw exception.
> Report
> Web Console should provide ability to rollback any transaction via UI.
> Long running transaction should be reported to logs. Log record should 
> contain: near nodes, transaction IDs, cache names, keys (limited to several 
> tens of), etc ( ?).
> Also there should be a screen in Web Console that will list all ongoing 
> transactions in the cluster including the info as above.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6895) TX deadlock monitoring

2017-12-25 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6895:
---

Assignee: Dmitriy Sorokin

> TX deadlock monitoring
> --
>
> Key: IGNITE-6895
> URL: https://issues.apache.org/jira/browse/IGNITE-6895
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Deadlocks with Cache Transactions
> Description
> Deadlocks of this type are possible if user locks 2 or more keys within 2 or 
> more transactions in different orders (this does not apply to OPTIMISTIC 
> SERIALIZABLE transactions as they are capable to detect deadlock and choose 
> winning tx). Currently, Ignite can detect deadlocked transactions but this 
> procedure is started only for transactions that have timeout set explicitly 
> or default timeout in configuration set to value greater than 0.
> Detection and Solution
> Each NEAR node should periodically (need new config property?) scan the list 
> of local transactions and initiate the same procedure as we have now for 
> timed out transactions. If deadlock found it should be reported to logs. Log 
> record should contain: near nodes, transaction IDs, cache names, keys 
> (limited to several tens of) involved in deadlock. User should have ability 
> to configure default behavior - REPORT_ONLY, ROLLBACK (any more?) or manually 
> rollback selected transaction through web console or Visor.
> Report
> If deadlock found it should be reported to logs. Log record should contain: 
> near nodes, transaction IDs, cache names, keys (limited to several tens of) 
> involved in deadlock.
> Also there should be a screen in Web Console that will list all ongoing 
> transactions in the cluster including the following info:
> - Near node
> - Start time
> - DHT nodes
> - Pending Locks (by request)
> Web Console should provide ability to rollback any transaction via UI.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-12-19 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-6890:

Comment: was deleted

(was: [~alex_pl], review my last patch, please.
[Ignite 2.0 Tests :: Ignite 
Basic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&tab=buildTypeStatusDiv&branch_Ignite20Tests=pull%2F3083%2Fhead])

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-12-19 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297049#comment-16297049
 ] 

Dmitriy Sorokin commented on IGNITE-6890:
-

[~alex_pl], review my last patch, please.
[Ignite 2.0 Tests :: Ignite 
Basic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&tab=buildTypeStatusDiv&branch_Ignite20Tests=pull%2F3083%2Fhead]

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5302) Empty LOST partition may be used as OWNING after resetting lost partitions

2017-12-19 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296899#comment-16296899
 ] 

Dmitriy Sorokin commented on IGNITE-5302:
-

The last research results for this ticket (on the head of branch ignite-5267):

Without grid activation after start of nodes (last line in code block below)
{code}
IgniteEx ignite1 = (IgniteEx)G.start(getConfiguration("test1"));
IgniteEx ignite2 = (IgniteEx)G.start(getConfiguration("test2"));
IgniteEx ignite3 = (IgniteEx)G.start(getConfiguration("test3"));
IgniteEx ignite4 = (IgniteEx)G.start(getConfiguration("test4"));

ignite1.active(true);
{code}
test fails with exception shown below:
{noformat}
class org.apache.ignite.IgniteException: Can not perform the operation because 
the cluster is inactive. Note, that the cluster is considered inactive by 
default if Ignite Persistent Store is used to let all the nodes join the 
cluster. To activate the cluster call Ignite.activate(true).

at 
org.apache.ignite.internal.IgniteKernal.checkClusterState(IgniteKernal.java:3693)
at org.apache.ignite.internal.IgniteKernal.cache(IgniteKernal.java:2713)
at 
org.apache.ignite.internal.processors.cache.persistence.IgnitePdsCacheRebalancingAbstractTest.testPartitionLossAndRecover(IgnitePdsCacheRebalancingAbstractTest.java:336)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at junit.framework.TestCase.runTest(TestCase.java:176)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1995)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:132)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1910)
at java.lang.Thread.run(Thread.java:745)
{noformat}

With grid activation test fails with assertion error:
{noformat}
java.lang.AssertionError
at 
org.apache.ignite.internal.processors.cache.persistence.IgnitePdsCacheRebalancingAbstractTest.testPartitionLossAndRecover(IgnitePdsCacheRebalancingAbstractTest.java:350)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at junit.framework.TestCase.runTest(TestCase.java:176)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1995)
at 
org.apache.ignite.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:132)
at 
org.apache.ignite.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1910)
at java.lang.Thread.run(Thread.java:745)
{noformat}

Assertion error shown above happens on that code line:
{code}
assert !ignite1.cache(cacheName).lostPartitions().isEmpty();
{code}

So, we need provide lost partitions in our test first, then we'll can fix the 
problem.

> Empty LOST partition may be used as OWNING after resetting lost partitions
> --
>
> Key: IGNITE-5302
> URL: https://issues.apache.org/jira/browse/IGNITE-5302
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain, Muted_test, test-fail
> Fix For: 2.4
>
>
> h2. Notes
> Test *testPartitionLossAndRecover* reproducing the issue can be found in 
> ignite-5267 branch with PDS functionality.
> h2. Steps to reproduce
> # Four nodes are started, some key is added to partitioned cache
> # Primary and backup nodes for the key are stopped, key's partition is 
> declared LOST on remaining nodes
> # Primary and backup nodes are started again, cache's lost partitions are 
> reset
> # Key is requested from cache
> h2. Expected behavior
> Correct value is returned from primary for this partition
> h2. Actual behavior
> Request for value is sent to node where partition is empty (not to primary 
> node), null is returned
> h2. Latest findings
> # The main problem with the scenario is that request for key gets mapped not 
> only to P/B nodes with real value but also to the node where that partition 
> existed only in LOST state after P/B shutdown on step #2
> # It was found that on step #3 after primary and backup are joined partition 
> counter is increased for empty partition in LOST state

[jira] [Assigned] (IGNITE-5302) Empty LOST partition may be used as OWNING after resetting lost partitions

2017-12-18 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5302:
---

Assignee: (was: Dmitriy Sorokin)

> Empty LOST partition may be used as OWNING after resetting lost partitions
> --
>
> Key: IGNITE-5302
> URL: https://issues.apache.org/jira/browse/IGNITE-5302
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain, Muted_test, test-fail
> Fix For: 2.4
>
>
> h2. Notes
> Test *testPartitionLossAndRecover* reproducing the issue can be found in 
> ignite-5267 branch with PDS functionality.
> h2. Steps to reproduce
> # Four nodes are started, some key is added to partitioned cache
> # Primary and backup nodes for the key are stopped, key's partition is 
> declared LOST on remaining nodes
> # Primary and backup nodes are started again, cache's lost partitions are 
> reset
> # Key is requested from cache
> h2. Expected behavior
> Correct value is returned from primary for this partition
> h2. Actual behavior
> Request for value is sent to node where partition is empty (not to primary 
> node), null is returned
> h2. Latest findings
> # The main problem with the scenario is that request for key gets mapped not 
> only to P/B nodes with real value but also to the node where that partition 
> existed only in LOST state after P/B shutdown on step #2
> # It was found that on step #3 after primary and backup are joined partition 
> counter is increased for empty partition in LOST state which looks wrong



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-12-14 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290688#comment-16290688
 ] 

Dmitriy Sorokin commented on IGNITE-6890:
-

[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite 2.0 Tests :: Ignite 
Basic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&branch_Ignite20Tests=pull%2F3083%2Fhead&tab=buildTypeStatusDiv]

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-13 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287401#comment-16287401
 ] 

Dmitriy Sorokin edited comment on IGNITE-6171 at 12/13/17 1:22 PM:
---

[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite 2.0 Tests :: Ignite 
Basic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&branch_Ignite20Tests=pull%2F3076%2Fhead&tab=buildTypeStatusDiv]


was (Author: cyberdemon):
[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite20Tests_IgniteBasic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&branch_Ignite20Tests=pull%2F3076%2Fhead&tab=buildTypeStatusDiv]

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-13 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287401#comment-16287401
 ] 

Dmitriy Sorokin edited comment on IGNITE-6171 at 12/13/17 1:21 PM:
---

[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite20Tests_IgniteBasic|https://ci.ignite.apache.org/viewType.html?buildTypeId=Ignite20Tests_IgniteBasic&branch_Ignite20Tests=pull%2F3076%2Fhead&tab=buildTypeStatusDiv]


was (Author: cyberdemon):
[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite20Tests_IgniteBasic|https://ci.ignite.apache.org/viewLog.html?buildId=991128&tab=buildResultsDiv&buildTypeId=Ignite20Tests_IgniteBasic]

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-13 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287401#comment-16287401
 ] 

Dmitriy Sorokin edited comment on IGNITE-6171 at 12/13/17 1:18 PM:
---

[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.
[Ignite20Tests_IgniteBasic|https://ci.ignite.apache.org/viewLog.html?buildId=991128&tab=buildResultsDiv&buildTypeId=Ignite20Tests_IgniteBasic]


was (Author: cyberdemon):
[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-12 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16287401#comment-16287401
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~avinogradov], review my new patch, please. I think that passing of Ignite 
Basic test suite is enouth for this patch.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-12-09 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284932#comment-16284932
 ] 

Dmitriy Sorokin commented on IGNITE-6890:
-

As was proposed by [~avinogradov] in discussion thread [Internal problems 
requiring graceful node shutdown, reboot, 
etc.|http://apache-ignite-developers.2346864.n4.nabble.com/Internal-problems-requiring-graceful-node-shutdown-reboot-etc-td24856.html],
 scheme with IgniteFailureHandler and IgniteFailureAction will be implemented:
{code}
interface IgniteFailureHandler { 
   IgniteFailureAction onFailure(IgniteFailureCause cause); 
} 

public enum IgniteFailureAction { 
RESTART_JVM, 
STOP, 
NOOP; 
} 
{code}

Default implementation of IgniteFailureHandler will be implemented and enabled 
by default, and the ability of setting a custom user implementation in 
IgniteConfiguration will be added.

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Issue Comment Deleted] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-08 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-6171:

Comment: was deleted

(was: [~avinogradov], review new patch, please!)

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-07 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281641#comment-16281641
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~avinogradov], review new patch, please!

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6742) Java 9: rework Cleaner usage in PlatformMemoryPool class

2017-12-04 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6742:
---

Assignee: Dmitriy Sorokin

> Java 9: rework Cleaner usage in PlatformMemoryPool class
> 
>
> Key: IGNITE-6742
> URL: https://issues.apache.org/jira/browse/IGNITE-6742
> Project: Ignite
>  Issue Type: Task
>  Components: platforms
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
> Fix For: 2.4
>
>
> We attach special cleaner to {{PlatformMemoryPool}} using 
> {{sun.misc.Cleaner.create}} method. This way we ensure that thread-local 
> native memory (which is used to pass data between platform and Java) is 
> released properly. 
> Need to rework this API to reflection-based approach, which works for both 
> Java 7/8 and Java 9.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-12-01 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16274600#comment-16274600
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~avinogradov], review new patch, please!

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6891) Proper behavior on Persistence errors

2017-11-24 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6891?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6891:
---

Assignee: Dmitriy Sorokin

> Proper behavior on Persistence errors 
> --
>
> Key: IGNITE-6891
> URL: https://issues.apache.org/jira/browse/IGNITE-6891
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-23 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264634#comment-16264634
 ] 

Dmitriy Sorokin edited comment on IGNITE-6171 at 11/23/17 5:22 PM:
---

[~avinogradov], please review new patch.


was (Author: cyberdemon):
Anton Vinogradov, please review new patch.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-23 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264634#comment-16264634
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

Anton Vinogradov, please review new patch.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-11-23 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264407#comment-16264407
 ] 

Dmitriy Sorokin commented on IGNITE-6890:
-

[~avinogradov], please review my patch.

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-23 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264324#comment-16264324
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

As was agreed with [~avinogradov] and [~vozerov],
the new metric will be added - a window (with configurable size) of pairs: time 
(in ms) -> duration (in ms) of jvm pause events, which duration exceeds a 
threshold.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-22 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16262267#comment-16262267
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~avinogradov], please review my corrections.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-21 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260873#comment-16260873
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~avinogradov], please review patch.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Affects Versions: 2.3
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
> Fix For: 2.4
>
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6890) Proper behavior on ExchangeWorker exits with error

2017-11-21 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6890:
---

Assignee: Dmitriy Sorokin

> Proper behavior on ExchangeWorker exits with error 
> ---
>
> Key: IGNITE-6890
> URL: https://issues.apache.org/jira/browse/IGNITE-6890
> Project: Ignite
>  Issue Type: Improvement
>Reporter: Anton Vinogradov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7
> Fix For: 2.4
>
>
> Node should be stopped anyway, what we can provide is user callback, 
> something like beforeNodeStop'.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-21 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260656#comment-16260656
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~vozerov], [~avinogradov]
Implementations and, moreover, existence of that bean may be different in 
different jvm implementations. Also, pauses theoretically may has cause other 
than GC STWs.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-21 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260600#comment-16260600
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

We discussed with [~avinogradov] the set of required metrics, and worked out 
the decision that the metrics will be values of total count and duration of 
pauses exceeding the threshold.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-16 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255216#comment-16255216
 ] 

Dmitriy Sorokin edited comment on IGNITE-6171 at 11/16/17 12:22 PM:


[~vozerov], [~avinogradov]
I think that we don't need to use JNI method, we only need a standard thread 
that wakes up through a small fixed timeout (20 ms, for example) and updates 
the time value by current system time. with calculating the difference with the 
previous value.
If the difference with the previous value will differ significantly from the 
expected one, this will mean that our thread has been frozen some time, and it 
does not matter if it was a STW pause or other cause of the system response 
degradation.
The system state with our control thread non-running more can't happen 
instantaneously, so we can detect the fact of system response degradation by 
this way.


was (Author: cyberdemon):
I think that we don't need to use JNI method, we only need a standard thread 
that wakes up through a small fixed timeout (20 ms, for example) and updates 
the time value by current system time. with calculating the difference with the 
previous value.
If the difference with the previous value will differ significantly from the 
expected one, this will mean that our thread has been frozen some time, and it 
does not matter if it was a STW pause or other cause of the system response 
degradation.
The system state with our control thread non-running more can't happen 
instantaneously, so we can detect the fact of system response degradation by 
this way.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-16 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16255216#comment-16255216
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

I think that we don't need to use JNI method, we only need a standard thread 
that wakes up through a small fixed timeout (20 ms, for example) and updates 
the time value by current system time. with calculating the difference with the 
previous value.
If the difference with the previous value will differ significantly from the 
expected one, this will mean that our thread has been frozen some time, and it 
does not matter if it was a STW pause or other cause of the system response 
degradation.
The system state with our control thread non-running more can't happen 
instantaneously, so we can detect the fact of system response degradation by 
this way.

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-15 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16253223#comment-16253223
 ] 

Dmitriy Sorokin commented on IGNITE-6171:
-

[~vozerov] [~avinogradov] Hi, people! STW pauses of what length we should 
consider as very long?

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: iep-7, usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6171) Native facility to control excessive GC pauses

2017-11-14 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-6171:
---

Assignee: Dmitriy Sorokin

> Native facility to control excessive GC pauses
> --
>
> Key: IGNITE-6171
> URL: https://issues.apache.org/jira/browse/IGNITE-6171
> Project: Ignite
>  Issue Type: Task
>  Components: general
>Reporter: Vladimir Ozerov
>Assignee: Dmitriy Sorokin
>  Labels: usability
>
> Ignite is Java-based application. If node experiences long GC pauses it may 
> negatively affect other nodes. We need to find a way to detect long GC pauses 
> within the process and trigger some actions in response, e.g. node stop. 
> This is a kind of Inception \[1\], when you need to understand that you sleep 
> while sleeping. As all Java threads are blocked on safepoint, we cannot use 
> Java's thread to detect Java's GC. Native threads should be used instead.
> Proposed solution:
> 1) Thread 1 should periodically call dummy JNI method returning current time, 
> and set this time to shared variable;
> 2) Thread 2 should periodically check that variable. If it has not been 
> changed for some time - most likely we are in GC pause. Once certain 
> threashold is reached - trigger compensating action, whether this is a 
> warning, process kill, or what so ever.
> Justification: crossing native -> Java boundaries involves safepoints. This 
> way Thread 1 will be trapped if STW pause is in progress. Java method cannot 
> be empty, as JVM is smart enough and can deduce it to no-op. 
> \[1\] http://www.imdb.com/title/tt1375666/



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-5811) Detect internal Ignite problems (java-level deadlock, hangs, etc) and act according to a policy configured.

2017-11-13 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5811:
---

Assignee: (was: Dmitriy Sorokin)

> Detect internal Ignite problems (java-level deadlock, hangs, etc) and act 
> according to a policy configured.
> ---
>
> Key: IGNITE-5811
> URL: https://issues.apache.org/jira/browse/IGNITE-5811
> Project: Ignite
>  Issue Type: New Feature
>Reporter: Yakov Zhdanov
>  Labels: usability
>
> This has something in common with segmentation policy we currently have. User 
> should get notified on a deadlock problem and node should take an action 
> (stop by default).
> Also Ignite may react on internal errors and hangs in the same way - fire 
> event and take the appropriate action.
> Current list of cases when node should (by default) stop itself:
> # Discovery reports segmentation (already implemented)
> # Critical discovery thread fails (already implemented)
> # NIO communication thread fails (already implemented)
> The following needs to be added
> # Java-deadlock detected
> # Internal threads stuck (no progress on current tasks during defined period)
> # ExchangeWorker exits with error
> We need to reapproach handling for all situations above to use the same 
> mechanism and make node take the action according to a configured policy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-5811) Detect internal Ignite problems (java-level deadlock, hangs, etc) and act according to a policy configured.

2017-11-13 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5811?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5811:
---

Assignee: Dmitriy Sorokin

> Detect internal Ignite problems (java-level deadlock, hangs, etc) and act 
> according to a policy configured.
> ---
>
> Key: IGNITE-5811
> URL: https://issues.apache.org/jira/browse/IGNITE-5811
> Project: Ignite
>  Issue Type: New Feature
>Reporter: Yakov Zhdanov
>Assignee: Dmitriy Sorokin
>  Labels: usability
>
> This has something in common with segmentation policy we currently have. User 
> should get notified on a deadlock problem and node should take an action 
> (stop by default).
> Also Ignite may react on internal errors and hangs in the same way - fire 
> event and take the appropriate action.
> Current list of cases when node should (by default) stop itself:
> # Discovery reports segmentation (already implemented)
> # Critical discovery thread fails (already implemented)
> # NIO communication thread fails (already implemented)
> The following needs to be added
> # Java-deadlock detected
> # Internal threads stuck (no progress on current tasks during defined period)
> # ExchangeWorker exits with error
> We need to reapproach handling for all situations above to use the same 
> mechanism and make node take the action according to a configured policy



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-5691) IgniteHadoopFileSystemShmemExternalDualAsyncSelfTest sometimes hangs on TC

2017-10-03 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5691:
---

Assignee: Dmitriy Sorokin

> IgniteHadoopFileSystemShmemExternalDualAsyncSelfTest sometimes hangs on TC
> --
>
> Key: IGNITE-5691
> URL: https://issues.apache.org/jira/browse/IGNITE-5691
> Project: Ignite
>  Issue Type: Bug
>  Components: hadoop
>Affects Versions: 2.1
>Reporter: Ilya Lantukh
>Assignee: Dmitriy Sorokin
>Priority: Critical
>  Labels: MakeTeamcityGreenAgain, Muted_test, test-fail
> Attachments: Ignite_2.0_Tests_Ignite_IGFS_Linux_and_MacOS_444.log.zip
>
>
> Hangs when output stream is closed:
> {noformat}
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] Thread 
> [name="test-runner-#15168%grid%", id=24808, state=WAITING, blockCnt=0, 
> waitCnt=3]
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> sun.misc.Unsafe.park(Native Method)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:315)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:176)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:139)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.i.processors.hadoop.impl.igfs.HadoopIgfsOutProc.closeStream(HadoopIgfsOutProc.java:446)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.i.processors.hadoop.impl.igfs.HadoopIgfsOutputStream.close(HadoopIgfsOutputStream.java:142)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> java.io.FilterOutputStream.close(FilterOutputStream.java:160)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> org.apache.hadoop.fs.FSDataOutputStream$PositionCache.close(FSDataOutputStream.java:70)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> org.apache.hadoop.fs.FSDataOutputStream.close(FSDataOutputStream.java:103)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.i.processors.hadoop.impl.igfs.IgniteHadoopFileSystemAbstractSelfTest.testDeleteSuccessfulIfPathIsOpenedToRead(IgniteHadoopFileSystemAbstractSelfTest.java:752)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> java.lang.reflect.Method.invoke(Method.java:606)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> junit.framework.TestCase.runTest(TestCase.java:176)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.testframework.junits.GridAbstractTest.runTestInternal(GridAbstractTest.java:1997)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.testframework.junits.GridAbstractTest.access$000(GridAbstractTest.java:132)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> o.a.i.testframework.junits.GridAbstractTest$5.run(GridAbstractTest.java:1912)
> [12:38:39]W:   [org.apache.ignite:ignite-hadoop] at 
> java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (IGNITE-5302) Empty LOST partition may be used as OWNING after resetting lost partitions

2017-10-01 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16187704#comment-16187704
 ] 

Dmitriy Sorokin commented on IGNITE-5302:
-

[~vozerov], I ran into the problem of the lack of lost partitions after 
stopping two of the four nodes, and now I have some questions about this topic. 
And I think that we can move the fix to release 2.4 too.

> Empty LOST partition may be used as OWNING after resetting lost partitions
> --
>
> Key: IGNITE-5302
> URL: https://issues.apache.org/jira/browse/IGNITE-5302
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain, Muted_test, test-fail
> Fix For: 2.3
>
>
> h2. Notes
> Test *testPartitionLossAndRecover* reproducing the issue can be found in 
> ignite-5267 branch with PDS functionality.
> h2. Steps to reproduce
> # Four nodes are started, some key is added to partitioned cache
> # Primary and backup nodes for the key are stopped, key's partition is 
> declared LOST on remaining nodes
> # Primary and backup nodes are started again, cache's lost partitions are 
> reset
> # Key is requested from cache
> h2. Expected behavior
> Correct value is returned from primary for this partition
> h2. Actual behavior
> Request for value is sent to node where partition is empty (not to primary 
> node), null is returned
> h2. Latest findings
> # The main problem with the scenario is that request for key gets mapped not 
> only to P/B nodes with real value but also to the node where that partition 
> existed only in LOST state after P/B shutdown on step #2
> # It was found that on step #3 after primary and backup are joined partition 
> counter is increased for empty partition in LOST state which looks wrong



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-5302) Empty LOST partition may be used as OWNING after resetting lost partitions

2017-09-15 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-5302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-5302:
---

Assignee: Dmitriy Sorokin

> Empty LOST partition may be used as OWNING after resetting lost partitions
> --
>
> Key: IGNITE-5302
> URL: https://issues.apache.org/jira/browse/IGNITE-5302
> Project: Ignite
>  Issue Type: Bug
>Reporter: Sergey Chugunov
>Assignee: Dmitriy Sorokin
>Priority: Blocker
>  Labels: MakeTeamcityGreenAgain, Muted_test, test-fail
> Fix For: 2.3
>
>
> h2. Notes
> Test *testPartitionLossAndRecover* reproducing the issue can be found in 
> ignite-5267 branch with PDS functionality.
> h2. Steps to reproduce
> # Four nodes are started, some key is added to partitioned cache
> # Primary and backup nodes for the key are stopped, key's partition is 
> declared LOST on remaining nodes
> # Primary and backup nodes are started again, cache's lost partitions are 
> reset
> # Key is requested from cache
> h2. Expected behavior
> Correct value is returned from primary for this partition
> h2. Actual behavior
> Request for value is sent to node where partition is empty (not to primary 
> node), null is returned
> h2. Latest findings
> # The main problem with the scenario is that request for key gets mapped not 
> only to P/B nodes with real value but also to the node where that partition 
> existed only in LOST state after P/B shutdown on step #2
> # It was found that on step #3 after primary and backup are joined partition 
> counter is increased for empty partition in LOST state which looks wrong



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (IGNITE-4181) The several runs of ServicesExample causes NPE

2017-08-04 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-4181:

Fix Version/s: (was: 2.2)
   2.1

> The several runs of ServicesExample causes NPE
> --
>
> Key: IGNITE-4181
> URL: https://issues.apache.org/jira/browse/IGNITE-4181
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6, 1.7, 2.0
> Environment: Windows 10, Oracle JDK 7
>Reporter: Sergey Kozlov
>Assignee: Dmitriy Sorokin
>  Labels: newbie
> Fix For: 2.1
>
>
> 0. Open example project in IDEA
> 1. Start 2-3 {{ExampleNodeStartup}}
> 2. Run {{ServicesExample}} several times.
> Sometimes it causes NullPointerException:
> {noformat}
> Executing closure [mapSize=10]
> Service was cancelled: myNodeSingletonService
> [15:37:20,020][INFO ][srvc-deploy-#24%null%][GridServiceProcessor] Cancelled 
> service instance [name=myNodeSingletonService, 
> execId=88a92a4d-c1cb-4a9b-8930-c67ac7f42bf3]
> [15:37:20,032][INFO ][sys-#33%null%][GridCacheProcessor] Stopped cache: 
> myNodeSingletonService
> [15:37:20,033][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=4], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,045][INFO ][sys-#39%null%][GridCacheProcessor] Stopped cache: 
> myClusterSingletonService
> [15:37:20,046][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=5], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,062][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Node left topology: TcpDiscoveryNode 
> [id=4f9cbc67-d756-4c25-9ee4-aee6528da024, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.25.4.107, 2001:0:9d38:6ab8:34b2:9f3e:3c6f:269], 
> sockAddrs=[/2001:0:9d38:6ab8:34b2:9f3e:3c6f:269:0, /127.0.0.1:0, 
> /0:0:0:0:0:0:0:1:0, work-pc/172.25.4.107:0], discPort=0, order=10, 
> intOrder=7, lastExchangeTime=1478522239236, loc=false, 
> ver=1.7.3#20161107-sha1:5132ac87, isClient=true]
> [15:37:20,063][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Topology snapshot [ver=11, servers=3, clients=0, CPUs=8, heap=11.0GB]
> [15:37:20,064][INFO ][sys-#44%null%][GridCacheProcessor] Stopped cache: 
> myMultiService
> [15:37:20,066][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=6], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,076][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myClusterSingletonService, mode=PARTITIONED]
> [15:37:20,115][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=7], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,121][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], evt=NODE_LEFT, node=4f9cbc67-d756-4c25-9ee4-aee6528da024]
> [15:37:20,133][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myMultiService, mode=PARTITIONED]
> [15:37:20,135][ERROR][exchange-worker-#23%null%][GridDhtPartitionsExchangeFuture]
>  Failed to reinitialize local partitions (preloading will be stopped): 
> GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11, 
> minorTopVer=1], nodeId=5faac72a, evt=DISCOVERY_CUSTOM_EVT]
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initStartedCacheOnCoordinator(CacheAffinitySharedManager.java:743)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:413)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:565)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:448)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1447)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:745)
> [15:37:2

[jira] [Assigned] (IGNITE-4181) The several runs of ServicesExample causes NPE

2017-07-25 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin reassigned IGNITE-4181:
---

Assignee: Dmitriy Sorokin  (was: Andrey Kuznetsov)

> The several runs of ServicesExample causes NPE
> --
>
> Key: IGNITE-4181
> URL: https://issues.apache.org/jira/browse/IGNITE-4181
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6, 1.7, 2.0
> Environment: Windows 10, Oracle JDK 7
>Reporter: Sergey Kozlov
>Assignee: Dmitriy Sorokin
>  Labels: newbie
> Fix For: 2.2
>
>
> 0. Open example project in IDEA
> 1. Start 2-3 {{ExampleNodeStartup}}
> 2. Run {{ServicesExample}} several times.
> Sometimes it causes NullPointerException:
> {noformat}
> Executing closure [mapSize=10]
> Service was cancelled: myNodeSingletonService
> [15:37:20,020][INFO ][srvc-deploy-#24%null%][GridServiceProcessor] Cancelled 
> service instance [name=myNodeSingletonService, 
> execId=88a92a4d-c1cb-4a9b-8930-c67ac7f42bf3]
> [15:37:20,032][INFO ][sys-#33%null%][GridCacheProcessor] Stopped cache: 
> myNodeSingletonService
> [15:37:20,033][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=4], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,045][INFO ][sys-#39%null%][GridCacheProcessor] Stopped cache: 
> myClusterSingletonService
> [15:37:20,046][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=5], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,062][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Node left topology: TcpDiscoveryNode 
> [id=4f9cbc67-d756-4c25-9ee4-aee6528da024, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.25.4.107, 2001:0:9d38:6ab8:34b2:9f3e:3c6f:269], 
> sockAddrs=[/2001:0:9d38:6ab8:34b2:9f3e:3c6f:269:0, /127.0.0.1:0, 
> /0:0:0:0:0:0:0:1:0, work-pc/172.25.4.107:0], discPort=0, order=10, 
> intOrder=7, lastExchangeTime=1478522239236, loc=false, 
> ver=1.7.3#20161107-sha1:5132ac87, isClient=true]
> [15:37:20,063][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Topology snapshot [ver=11, servers=3, clients=0, CPUs=8, heap=11.0GB]
> [15:37:20,064][INFO ][sys-#44%null%][GridCacheProcessor] Stopped cache: 
> myMultiService
> [15:37:20,066][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=6], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,076][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myClusterSingletonService, mode=PARTITIONED]
> [15:37:20,115][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=7], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,121][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], evt=NODE_LEFT, node=4f9cbc67-d756-4c25-9ee4-aee6528da024]
> [15:37:20,133][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myMultiService, mode=PARTITIONED]
> [15:37:20,135][ERROR][exchange-worker-#23%null%][GridDhtPartitionsExchangeFuture]
>  Failed to reinitialize local partitions (preloading will be stopped): 
> GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11, 
> minorTopVer=1], nodeId=5faac72a, evt=DISCOVERY_CUSTOM_EVT]
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initStartedCacheOnCoordinator(CacheAffinitySharedManager.java:743)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:413)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:565)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:448)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1447)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:745)
> [15

[jira] [Commented] (IGNITE-4181) The several runs of ServicesExample causes NPE

2017-07-24 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16098501#comment-16098501
 ] 

Dmitriy Sorokin commented on IGNITE-4181:
-

This issue seems resolved by:

commit 7e45010b4848d0a570995e6dc938875710d846d8
Author: sboikov 
Date: 2017-06-04T08:02:31Z

ignite-5075 'logical' caches sharing the same 'physical' cache group

> The several runs of ServicesExample causes NPE
> --
>
> Key: IGNITE-4181
> URL: https://issues.apache.org/jira/browse/IGNITE-4181
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6, 1.7, 2.0
> Environment: Windows 10, Oracle JDK 7
>Reporter: Sergey Kozlov
>Assignee: Andrey Kuznetsov
>  Labels: newbie
> Fix For: 2.2
>
>
> 0. Open example project in IDEA
> 1. Start 2-3 {{ExampleNodeStartup}}
> 2. Run {{ServicesExample}} several times.
> Sometimes it causes NullPointerException:
> {noformat}
> Executing closure [mapSize=10]
> Service was cancelled: myNodeSingletonService
> [15:37:20,020][INFO ][srvc-deploy-#24%null%][GridServiceProcessor] Cancelled 
> service instance [name=myNodeSingletonService, 
> execId=88a92a4d-c1cb-4a9b-8930-c67ac7f42bf3]
> [15:37:20,032][INFO ][sys-#33%null%][GridCacheProcessor] Stopped cache: 
> myNodeSingletonService
> [15:37:20,033][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=4], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,045][INFO ][sys-#39%null%][GridCacheProcessor] Stopped cache: 
> myClusterSingletonService
> [15:37:20,046][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=5], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,062][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Node left topology: TcpDiscoveryNode 
> [id=4f9cbc67-d756-4c25-9ee4-aee6528da024, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.25.4.107, 2001:0:9d38:6ab8:34b2:9f3e:3c6f:269], 
> sockAddrs=[/2001:0:9d38:6ab8:34b2:9f3e:3c6f:269:0, /127.0.0.1:0, 
> /0:0:0:0:0:0:0:1:0, work-pc/172.25.4.107:0], discPort=0, order=10, 
> intOrder=7, lastExchangeTime=1478522239236, loc=false, 
> ver=1.7.3#20161107-sha1:5132ac87, isClient=true]
> [15:37:20,063][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Topology snapshot [ver=11, servers=3, clients=0, CPUs=8, heap=11.0GB]
> [15:37:20,064][INFO ][sys-#44%null%][GridCacheProcessor] Stopped cache: 
> myMultiService
> [15:37:20,066][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=6], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,076][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myClusterSingletonService, mode=PARTITIONED]
> [15:37:20,115][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=7], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,121][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], evt=NODE_LEFT, node=4f9cbc67-d756-4c25-9ee4-aee6528da024]
> [15:37:20,133][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myMultiService, mode=PARTITIONED]
> [15:37:20,135][ERROR][exchange-worker-#23%null%][GridDhtPartitionsExchangeFuture]
>  Failed to reinitialize local partitions (preloading will be stopped): 
> GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11, 
> minorTopVer=1], nodeId=5faac72a, evt=DISCOVERY_CUSTOM_EVT]
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initStartedCacheOnCoordinator(CacheAffinitySharedManager.java:743)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:413)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:565)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:448)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.bod

[jira] [Commented] (IGNITE-4181) The several runs of ServicesExample causes NPE

2017-07-24 Thread Dmitriy Sorokin (JIRA)

[ 
https://issues.apache.org/jira/browse/IGNITE-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16098485#comment-16098485
 ] 

Dmitriy Sorokin commented on IGNITE-4181:
-

This issue was caused by mechanics which uses GridCacheProcessor's cache 
descriptors map concurrently by event handlers at TCP discovery workers and 
exchange processors at exchange worker. Cache creation event puts newly created 
cache descriptor into GCP's cache descriptors map, then emits exchange, and 
processing of one try take cache descriptor, which already can be removed by 
cache deletion event, processed later then cache creation event but earlier 
then cache creation exchange.

> The several runs of ServicesExample causes NPE
> --
>
> Key: IGNITE-4181
> URL: https://issues.apache.org/jira/browse/IGNITE-4181
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6, 1.7, 2.0
> Environment: Windows 10, Oracle JDK 7
>Reporter: Sergey Kozlov
>Assignee: Andrey Kuznetsov
>  Labels: newbie
> Fix For: 2.2
>
>
> 0. Open example project in IDEA
> 1. Start 2-3 {{ExampleNodeStartup}}
> 2. Run {{ServicesExample}} several times.
> Sometimes it causes NullPointerException:
> {noformat}
> Executing closure [mapSize=10]
> Service was cancelled: myNodeSingletonService
> [15:37:20,020][INFO ][srvc-deploy-#24%null%][GridServiceProcessor] Cancelled 
> service instance [name=myNodeSingletonService, 
> execId=88a92a4d-c1cb-4a9b-8930-c67ac7f42bf3]
> [15:37:20,032][INFO ][sys-#33%null%][GridCacheProcessor] Stopped cache: 
> myNodeSingletonService
> [15:37:20,033][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=4], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,045][INFO ][sys-#39%null%][GridCacheProcessor] Stopped cache: 
> myClusterSingletonService
> [15:37:20,046][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=5], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,062][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Node left topology: TcpDiscoveryNode 
> [id=4f9cbc67-d756-4c25-9ee4-aee6528da024, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.25.4.107, 2001:0:9d38:6ab8:34b2:9f3e:3c6f:269], 
> sockAddrs=[/2001:0:9d38:6ab8:34b2:9f3e:3c6f:269:0, /127.0.0.1:0, 
> /0:0:0:0:0:0:0:1:0, work-pc/172.25.4.107:0], discPort=0, order=10, 
> intOrder=7, lastExchangeTime=1478522239236, loc=false, 
> ver=1.7.3#20161107-sha1:5132ac87, isClient=true]
> [15:37:20,063][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Topology snapshot [ver=11, servers=3, clients=0, CPUs=8, heap=11.0GB]
> [15:37:20,064][INFO ][sys-#44%null%][GridCacheProcessor] Stopped cache: 
> myMultiService
> [15:37:20,066][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=6], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,076][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myClusterSingletonService, mode=PARTITIONED]
> [15:37:20,115][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=7], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,121][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], evt=NODE_LEFT, node=4f9cbc67-d756-4c25-9ee4-aee6528da024]
> [15:37:20,133][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myMultiService, mode=PARTITIONED]
> [15:37:20,135][ERROR][exchange-worker-#23%null%][GridDhtPartitionsExchangeFuture]
>  Failed to reinitialize local partitions (preloading will be stopped): 
> GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11, 
> minorTopVer=1], nodeId=5faac72a, evt=DISCOVERY_CUSTOM_EVT]
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initStartedCacheOnCoordinator(CacheAffinitySharedManager.java:743)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:413)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartiti

[jira] [Updated] (IGNITE-4181) The several runs of ServicesExample causes NPE

2017-07-24 Thread Dmitriy Sorokin (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-4181?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Sorokin updated IGNITE-4181:

Affects Version/s: 2.0

> The several runs of ServicesExample causes NPE
> --
>
> Key: IGNITE-4181
> URL: https://issues.apache.org/jira/browse/IGNITE-4181
> Project: Ignite
>  Issue Type: Bug
>  Components: general
>Affects Versions: 1.6, 1.7, 2.0
> Environment: Windows 10, Oracle JDK 7
>Reporter: Sergey Kozlov
>Assignee: Andrey Kuznetsov
>  Labels: newbie
> Fix For: 2.2
>
>
> 0. Open example project in IDEA
> 1. Start 2-3 {{ExampleNodeStartup}}
> 2. Run {{ServicesExample}} several times.
> Sometimes it causes NullPointerException:
> {noformat}
> Executing closure [mapSize=10]
> Service was cancelled: myNodeSingletonService
> [15:37:20,020][INFO ][srvc-deploy-#24%null%][GridServiceProcessor] Cancelled 
> service instance [name=myNodeSingletonService, 
> execId=88a92a4d-c1cb-4a9b-8930-c67ac7f42bf3]
> [15:37:20,032][INFO ][sys-#33%null%][GridCacheProcessor] Stopped cache: 
> myNodeSingletonService
> [15:37:20,033][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=4], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,045][INFO ][sys-#39%null%][GridCacheProcessor] Stopped cache: 
> myClusterSingletonService
> [15:37:20,046][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=5], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,062][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Node left topology: TcpDiscoveryNode 
> [id=4f9cbc67-d756-4c25-9ee4-aee6528da024, addrs=[0:0:0:0:0:0:0:1, 127.0.0.1, 
> 172.25.4.107, 2001:0:9d38:6ab8:34b2:9f3e:3c6f:269], 
> sockAddrs=[/2001:0:9d38:6ab8:34b2:9f3e:3c6f:269:0, /127.0.0.1:0, 
> /0:0:0:0:0:0:0:1:0, work-pc/172.25.4.107:0], discPort=0, order=10, 
> intOrder=7, lastExchangeTime=1478522239236, loc=false, 
> ver=1.7.3#20161107-sha1:5132ac87, isClient=true]
> [15:37:20,063][INFO ][disco-event-worker-#20%null%][GridDiscoveryManager] 
> Topology snapshot [ver=11, servers=3, clients=0, CPUs=8, heap=11.0GB]
> [15:37:20,064][INFO ][sys-#44%null%][GridCacheProcessor] Stopped cache: 
> myMultiService
> [15:37:20,066][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=6], evt=DISCOVERY_CUSTOM_EVT, 
> node=5faac72a-72ab-4277-9643-0e962973b3f4]
> [15:37:20,076][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myClusterSingletonService, mode=PARTITIONED]
> [15:37:20,115][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=10, 
> minorTopVer=7], evt=DISCOVERY_CUSTOM_EVT, 
> node=478f1752-fdce-42c6-aef6-55a5f4c08d90]
> [15:37:20,121][INFO 
> ][exchange-worker-#23%null%][GridCachePartitionExchangeManager] Skipping 
> rebalancing (nothing scheduled) [top=AffinityTopologyVersion [topVer=11, 
> minorTopVer=0], evt=NODE_LEFT, node=4f9cbc67-d756-4c25-9ee4-aee6528da024]
> [15:37:20,133][INFO ][exchange-worker-#23%null%][GridCacheProcessor] Started 
> cache [name=myMultiService, mode=PARTITIONED]
> [15:37:20,135][ERROR][exchange-worker-#23%null%][GridDhtPartitionsExchangeFuture]
>  Failed to reinitialize local partitions (preloading will be stopped): 
> GridDhtPartitionExchangeId [topVer=AffinityTopologyVersion [topVer=11, 
> minorTopVer=1], nodeId=5faac72a, evt=DISCOVERY_CUSTOM_EVT]
> java.lang.NullPointerException
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.initStartedCacheOnCoordinator(CacheAffinitySharedManager.java:743)
>   at 
> org.apache.ignite.internal.processors.cache.CacheAffinitySharedManager.onCacheChangeRequest(CacheAffinitySharedManager.java:413)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.onCacheChangeRequest(GridDhtPartitionsExchangeFuture.java:565)
>   at 
> org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture.init(GridDhtPartitionsExchangeFuture.java:448)
>   at 
> org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$ExchangeWorker.body(GridCachePartitionExchangeManager.java:1447)
>   at 
> org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
>   at java.lang.Thread.run(Thread.java:745)
> [15:37:20,142][ERROR][exchange-worker

<    1   2