Re: Cluster hung after a node killed

2017-01-04 Thread vkulichenko
Sam,

There is no exact date as for now. I would recommend to monitor the dev list
for activity around this.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9883.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2017-01-04 Thread javastuff....@gmail.com
Thanks Val. Ticket has been fixed for 1.9, Any idea when 1.9 will be
available?

-Sam



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9880.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-12-16 Thread vkulichenko
Hi Sam,

I reproduced the issue using your code and created a ticket:
https://issues.apache.org/jira/browse/IGNITE-4450

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9616.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-12-09 Thread javastuff....@gmail.com
Hi Val,

Did you had chance to look at attached sample program? Did it help figuring
out what is going on?

-Sam



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9465.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-12-02 Thread javastuff....@gmail.com
Attaching 4 classes -
Node1.java - create empty cache node. 
Node2.java - seed cache with 100K <int,Str> dat.
Node3.java - take explicit lock on a key and wait for 15 seconds to unlock. 
Node4.java - fetch cached data.

Steps -
1. Run Node1, wait for empty node to boot up.
2. Run Node2, wait for completion message.
3. Run Node3, kill it when it prompts for "kill me...".
4. Run Node4 Topology snapshot on other node will show node joined, however
Node 4 will not able to complete with any fetch operations. Fetch from cache
hung.
5. Run another instance of Node2, it will not able to complete with any put
operation. Put to cache hung.

Node1.java
<http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node1.java>  
Node2.java
<http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node2.java>  
Node3.java
<http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node3.java>  
Node4.java
<http://apache-ignite-users.70518.x6.nabble.com/file/n9375/Node4.java>  



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9375.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-12-01 Thread vkulichenko
Hi Sam,

Can you create a small project on GitHub with instructions how to run? I
just tried to repeat the same test, but everything works fine for me.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9340.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-12-01 Thread javastuff....@gmail.com
Val,

I have reproduced this with simple program - 
1. Node 1 - run example ExampleNodeStartup 
2. Node 2 - Run a program which create a transaction cache and add 100K
simple entries. <String, String> 
cfg.setCacheMode(CacheMode.PARTITIONED); 
cfg.setAtomicityMode(CacheAtomicityMode.TRANSACTIONAL); 
cfg.setMemoryMode(CacheMemoryMode.OFFHEAP_TIERED); 
cfg.setSwapEnabled(false); 
cfg.setBackups(0); 
3. Node 3 - Run a program which takes a lock (cache.lock(key)) 
4. Kill Node 3 before it can unlock.
5. Node 4 - Run a program which tries to get cached data. 

Node4 not able to join cluster, it is hung. In-fact complete cluster is hung
andy operation from Node2 also hung.

-Sam



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9337.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-30 Thread vkulichenko
Sam,

It depends on the reason of this hang. If there is a bug, then most likely
timeout will not help either :) Do you have a test that reproduces the
issue?

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9317.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-30 Thread javastuff....@gmail.com
Hi Val,

Killing the node that acquired the lock did not release it automatically and
leads whole cluster in hung state, any operation on any cache (not related
to lock) are in wait state. Cluster is not able to recover seamlessly. Looks
like a bug to me. 

I understand lock timeout can be error-prone, if configured correctly then
lock timeout can provide a second way of auto recovery in such failover
cases. Is there any way to configure timeouts on lock?

Explicit lock is one of the usecase we have, but Cluster auto recovery
during any change to cluster is the most important, so if these 2 not going
together then its a show stopper.

-Sam





--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9309.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-30 Thread thammoud
Hi Val,

I am curious as to how will ignite cluster behave with a killed node that is
in an active transaction with implicit cache locks. Thank you. BTW, when
will 1.8 be released?



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9297.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-28 Thread vkulichenko
Hi Sam,

If there is an explicit lock that you acquired and not released properly,
there is not much we can do. Explicit locks must be released explicitly,
doing this by timeout even more error-prone in my view. BTW, only stopping
the node that acquired the lock should've helped as well, because locks are
released in this case automatically.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9247.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-28 Thread javastuff....@gmail.com
Ideally cluster should recover seamlessly.  Is there any locktimeout which I
can configure? or any other configuration which will make sure locks taken
by crashing node gets released and cluster still serves all requests?

Is this a bug?

-Sam



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9243.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-16 Thread javastuff....@gmail.com
Could you please elaborate your suspicion?
 
addRoleDelegationToCache and addDocument calls were made after killing
node3, these calls trying to push data into cache and we are not using any
transaction API to start or commit transaction on cache explicitly while
pushing data to cache. And these calls are made by node1 to access regular
application. Application access was not made immediately after killing
node3, I tried to access application after about 3-5 minutes. And killing
node3 was done when system was idle. 

Unfortunately I can not share application. I can try to reproduce it again
and provide more logs if needed or try to write a test program to simulate
it.

Let me know if you need more logs, probably DEBUG level logs.   

Below are some pointers from threaddump and logs -

1. Below from thread dump made me assume topology change not completed and
any cache operation later on waiting on it.
/org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523)
/

2. Node3 address 10.107.186.137, 17:03:48,223 is the time when Server2 log
first detected failed node. Below logs -

/17:03:48,223 WARNING [org.apache.ignite.spi.discovery.tcp.TcpDiscoverySpi]
(tcp-disco-msg-worker-#2%TESTNODE%) Local node has detected failed nodes and
started cluster-wide procedure. To speed up failure detection please see
'Failure Detection' section under javadoc for 'TcpDiscoverySpi'
17:03:48,237 WARNING
[org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
(disco-event-worker-#254%TESTNODE%) Node FAILED: TcpDiscoveryNode
[id=e840a775-36b9-48d3-993c-25dea95d59d0, addrs=[10.107.186.137, 10.245.1.1,
127.0.0.1, 192.168.122.1], sockAddrs=[/192.168.122.1:48500,
/10.107.186.137:48500, /10.245.1.1:48500, /127.0.0.1:48500], discPort=48500,
order=55, intOrder=29, lastExchangeTime=1478911399482, loc=false,
ver=1.7.0#20160801-sha1:383273e3, isClient=false]
17:03:48,240 INFO  [stdout] (disco-event-worker-#254%TESTNODE%) [17:03:48]
Topology snapshot [ver=56, servers=2, clients=0, CPUs=96, heap=4.0GB]
17:03:48,241 INFO 
[org.apache.ignite.internal.managers.discovery.GridDiscoveryManager]
(disco-event-worker-#254%TESTNODE%) Topology snapshot [ver=56, servers=2,
clients=0, CPUs=96, heap=4.0GB]
17:03:58,480 WARNING
[org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager]
(exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange
[topVer=AffinityTopologyVersion [topVer=56, minorTopVer=0],
node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that
might be the cause: 
17:03:58,482 WARNING
[org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager]
(exchange-worker-#256%TESTNODE%) Ready affinity version:
AffinityTopologyVersion [topVer=55, minorTopVer=1]/






--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9025.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-15 Thread javastuff....@gmail.com
logs.zip <http://apache-ignite-users.70518.x6.nabble.com/file/n9010/logs.zip>  

Attaching threaddump and log from 2 nodes. Lost logs from node 3 which was
killed using "kill-9 "

Let me know if you need more logs. 

In my understanding, after killing node 3 topology version update screwed-up
and Node2 keeps on complaining about failed Node3. Node 1 tried to access
application which hung on get or put ignite call because of topology
mismatch or locked.

-Sam 




--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p9010.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-15 Thread vdpyatkov
Hi,

I think, INFO log level would be enough (from all nodes).



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8991.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-14 Thread javastuff....@gmail.com
Do you want Ignite to be running in DEBUG? or System.out should be enough
from all 3 nodes? 



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8978.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Cluster hung after a node killed

2016-11-14 Thread vkulichenko
Hi Sam,

Please attach full logs and full thread dumps if you want someone to take a
look. There is not enough information in your message to understand the
reason of the issue.

-Val



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965p8976.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Cluster hung after a node killed

2016-11-14 Thread javastuff....@gmail.com
Hi,

I have configured cache as off-heap partitioned cache. Running 3 nodes on
separate machine. Loaded some data into cache using my application's normal
operations. 

Used "/kill -9 /" to kill node 3.

Node 2 shows below Warning on console after every 10 seconds -

/11:03:03,320 WARNING
[org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager]
(exchange-worker-#256%TESTNODE%) Failed to wait for partition map exchange
[topVer=AffinityTopologyVersion [topVer=3, minorTopVer=0],
node=8cc0ac24-24b9-4d69-8472-b6a567f4d907]. Dumping pending objects that
might be the cause:/

Node 1 looks fine. However application does not work anymore and threaddump
shows it is waiting on cache put -

/java.lang.Thread.State: WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for  <0x0007ecbd4a38> (a
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache$AffinityReadyFuture)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:186)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:834)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:994)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1303)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:159)
at
org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:117)
at
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.awaitTopologyVersion(GridAffinityAssignmentCache.java:523)
at
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.cachedAffinity(GridAffinityAssignmentCache.java:434)
at
org.apache.ignite.internal.processors.affinity.GridAffinityAssignmentCache.nodes(GridAffinityAssignmentCache.java:387)
at
org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.nodes(GridCacheAffinityManager.java:259)
at
org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:295)
at
org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:286)
at
org.apache.ignite.internal.processors.cache.GridCacheAffinityManager.primary(GridCacheAffinityManager.java:310)
at
org.apache.ignite.internal.processors.cache.distributed.dht.colocated.GridDhtColocatedCache.entryExx(GridDhtColocatedCache.java:176)
at
org.apache.ignite.internal.processors.cache.distributed.near.GridNearTxLocal.entryEx(GridNearTxLocal.java:1251)
at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.enlistWriteEntry(IgniteTxLocalAdapter.java:2354)
at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.enlistWrite(IgniteTxLocalAdapter.java:1990)
at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.putAsync0(IgniteTxLocalAdapter.java:2902)
at
org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.putAsync(IgniteTxLocalAdapter.java:1859)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter$22.op(GridCacheAdapter.java:2240)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter$22.op(GridCacheAdapter.java:2238)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter.syncOp(GridCacheAdapter.java:4351)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2238)
at
org.apache.ignite.internal.processors.cache.GridCacheAdapter.put(GridCacheAdapter.java:2215)
at
org.apache.ignite.internal.processors.cache.IgniteCacheProxy.put(IgniteCacheProxy.java:1214)/


Is there any specific configuration I need to provide for self recovery of
cluster? Losing cache data is fine, data is backedup in some persistent
store Example - DATABASE.

-Sam



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Cluster-hung-after-a-node-killed-tp8965.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.