Re: Ignite node crashed

2021-05-13 Thread Zhenya Stanilovsky

Lo, Marcus, hi ! Seems problem really due to long gc pause.
Do you apply all suggestions from [1] [2] ?
[1]  https://apacheignite.readme.io/docs/jvm-and-system-tuning
[2]  https://apacheignite.readme.io/docs/jvm-and-system-tuning#memory-issues
 
>Hi,
> 
>I have a 5 node Ignite cluster setup, and it seems that when I start to create 
>table in the cluster, one of the node would crash. All of the nodes are VM 
>with 8 CPUs and 128GB of memory. I have attached the log file, gc file and 
>also the xml config for the crashing node (with default data region of 90GB, 
>and heap size of 10GB). I can see the node having a long GC starting from 
>04:29:58, but unfortunately the gc log doesn’t show anything at that time. Can 
>you please shed some light on the issue? Thanks.
> 
>Regards,
>Marcus
>  
 
 
 
 

Re: Ignite Node crashed in middle of checkpoint and data loss

2019-02-20 Thread Ilya Kasnacheev
Hello!

Can you share your data files (WAL and db) so that we could try and
reproduce the crash?

If it is not feasible my recommendation is to try bring this data up when
starting with Nightly Build instead of 2.7:
https://ignite.apache.org/download.cgi#nightly-builds

Regards,
-- 
Ilya Kasnacheev


вт, 19 февр. 2019 г. в 09:03, garima.j :

> Hello,
>
> We have an ignite cluster of 3 nodes (16GB RAM, 50GB disk space each)and
> have given 10GB (off-heap) to data region and (Xms) 2GB and (Xmx) 3GB to
> the
> nodes.
>
> One node went down and while restarting the node, I get the exception that
> Ignite node crashed in the middle of checkpoint and JVM crash after that.
>
> Ignite configuration :
>  
>  class="org.apache.ignite.configuration.TransactionConfiguration">
>
>  value="2"/>
>
> 
> 
>
> 
> 
>
> Data Storage configuration :
> 
>  class="org.apache.ignite.configuration.DataStorageConfiguration">
> 
>  class="org.apache.ignite.configuration.DataRegionConfiguration">
> 
> 
> 
> 
>  value="#{1L * 1024 * 1024 * 1024}"/>
>  value="RANDOM_2_LRU"/>
> 
> 
>  value="/data1/data/datastore"/>
> 
>  value="/data2/data/wal/archive"/>
> 
> 
> 
> 
> 
> 
> 
> 
>
> Cache configuration :
> 
> 
> 
> 
> 
> 
> 
> 
>  value="TRANSACTIONAL_SNAPSHOT"/>
>
> Please find the logs(FINE level) and JVM crash logs.
>
> hs_err_pid6456.log
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2241/hs_err_pid6456.log>
>
> ignite-8393e373.log
> <
> http://apache-ignite-users.70518.x6.nabble.com/file/t2241/ignite-8393e373.log>
>
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


Ignite Node crashed in middle of checkpoint and data loss

2019-02-18 Thread garima.j
Hello,

We have an ignite cluster of 3 nodes (16GB RAM, 50GB disk space each)and
have given 10GB (off-heap) to data region and (Xms) 2GB and (Xmx) 3GB to the
nodes. 

One node went down and while restarting the node, I get the exception that
Ignite node crashed in the middle of checkpoint and JVM crash after that. 

Ignite configuration : 
 










Data Storage configuration : 
























Cache configuration : 










Please find the logs(FINE level) and JVM crash logs.

hs_err_pid6456.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2241/hs_err_pid6456.log>  
ignite-8393e373.log
<http://apache-ignite-users.70518.x6.nabble.com/file/t2241/ignite-8393e373.log> 
 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-21 Thread dkarachentsev
Hi,

Please share full logs and thread dumps, it's hard to understand the root
cause.

Thanks!
-Dmitry.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16341.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-20 Thread iostream
Hi Roger,

I have experienced a similar issue during cluster activation in my setup as
well. I had shared my logs here -
http://apache-ignite-users.70518.x6.nabble.com/Activating-Cluster-taking-too-long-td16093.html

Eagerly seeking a root cause and resolution for this.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16318.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-15 Thread Roger Fischer (CW)
Hi Dmitry and Alex,

the cache contains 19.2M objects. The work/db directory is 23, 26 and 22 GB 
respectively. The 3 nodes have 8 GB RAM each.

I initiated deactivate at 14:13:39. As of 16:50:00, deactivate has not 
completed. Only server node 2 continues to log warnings.



The client shows the following logs:

[14:13:39,473][INFO][main][GridClusterStateProcessor] Sending deactivate 
request from node [id=548f4233-67e9-4043-aa3a-086fb541c427, 
topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], client=true, 
daemonfalse]
[14:13:40,369][INFO][tcp-client-disco-msg-worker-#4%null%][GridClusterStateProcessor]
 Start state transition: false
[14:13:40,395][INFO][exchange-worker-#96%null%][time] Started exchange init 
[topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], crd=false, evt=18, 
node=TcpDiscoveryNode [id=548f4233-67e9-4043-aa3a-086fb541c427, 
addrs=[0:0:0:0:0:0:0:1%lo, 10.24.51.187, 127.0.0.1, 
2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, 
/127.0.0.1:0, /2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3:0, 
rfische-2.englab.brocade.com/10.24.51.187:0], discPort=0, order=11, intOrder=0, 
lastExchangeTime=1502830168939, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, 
isClient=true], evtNode=TcpDiscoveryNode 
[id=548f4233-67e9-4043-aa3a-086fb541c427, addrs=[0:0:0:0:0:0:0:1%lo, 
10.24.51.187, 127.0.0.1, 2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3], 
sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, 
/2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3:0, 
rfische-2.englab.brocade.com/10.24.51.187:0], discPort=0, order=11, intOrder=0, 
lastExchangeTime=1502830168939, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, 
isClient=true], customEvt=ChangeGlobalStateMessage 
[id=c90edf2ed51-d51246ce-e4d1-46f7-b156-f1ceac90bb7a, 
reqId=5505420d-5d31-4d2c-b0ae-fa7a77629d2d, 
initiatingNodeId=bda65979-33d1-4d6f-8a32-45b71255f835, activate=false]]
[14:13:40,396][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture]
 Start deactivation process [nodeId=548f4233-67e9-4043-aa3a-086fb541c427, 
client=true, topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1]]
[14:13:40,397][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture]
 Successfully deactivated data structures, services and caches 
[nodeId=548f4233-67e9-4043-aa3a-086fb541c427, client=true, 
topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1]]
[14:13:40,398][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture]
 Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=12, 
minorTopVer=1], time=0ms]
[14:13:40,398][INFO][exchange-worker-#96%null%][time] Finished exchange init 
[topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], crd=false]
[14:13:41,173][INFO][tcp-client-disco-msg-worker-#4%null%][GridClusterStateProcessor]
 Received state change finish message: false
[14:13:45,355][INFO][grid-timeout-worker-#15%null%][IgniteKernal]
Metrics for local node (to disable set 'metricsLogFrequency' to 0)
^-- Node [id=548f4233, name=null, uptime=00:24:07:982]
^-- H/N/C [hosts=3, nodes=4, CPUs=12]
^-- CPU [cur=0.6%, avg=0.89%, GC=0%]
^-- PageMemory [pages=0]
^-- Heap [used=248MB, free=86.36%, comm=951MB]
^-- Non heap [used=48MB, free=-1%, comm=49MB]
^-- Public thread pool [active=0, idle=0, qSize=0]
^-- System thread pool [active=0, idle=0, qSize=0]
^-- Outbound messages queue [size=0]
[14:13:50,399][WARNING][exchange-worker-#96%null%][diagnostic] Failed to wait 
for partition map exchange [topVer=AffinityTopologyVersion [topVer=12, 
minorTopVer=1], node=548f4233-67e9-4043-aa3a-086fb541c427]. Dumping pending 
objects that might be the cause:
[14:13:50,400][WARNING][exchange-worker-#96%null%][diagnostic] Ready affinity 
version: AffinityTopologyVersion [topVer=12, minorTopVer=0]
[14:13:50,624][WARNING][exchange-worker-#96%null%][diagnostic] Last exchange 
future: GridDhtPartitionsExchangeFuture [dummy=false, forcePreload=false, 
reassign=false, discoEvt=DiscoveryCustomEvent 
[customMsg=ChangeGlobalStateMessage 
[id=c90edf2ed51-d51246ce-e4d1-46f7-b156-f1ceac90bb7a, 
reqId=5505420d-5d31-4d2c-b0ae-fa7a77629d2d, 
initiatingNodeId=bda65979-33d1-4d6f-8a32-45b71255f835, activate=false], 
affTopVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], 
super=DiscoveryEvent [evtNode=TcpDiscoveryNode 
[id=bda65979-33d1-4d6f-8a32-45b71255f835, addrs=[0:0:0:0:0:0:0:1%lo, 
10.24.51.190, 127.0.0.1, 2620:100:0:fe07:f92c:9dbd:9b0f:9982%enp0s3], 
sockAddrs=[/2620:100:0:fe07:f92c:9dbd:9b0f:9982%enp0s3:47500, 
rfische-1.englab.brocade.com/10.24.51.190:47500, /0:0:0:0:0:0:0:1%lo:47500, 
/127.0.0.1:47500], discPort=47500, order=1, intOrder=1, 
lastExchangeTime=1502831381649, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, 
isClient=false], topVer=12, nodeId8=548f4233, msg=null, 
type=DISCOVERY_CUSTOM_EVT, tstamp=1502831620392]], crd=TcpDiscoveryNode 
[id=bda65979-33d1-4d6f-8a32-45b71255f835, addrs=[0:0:0:0:0:0:0:1%lo, 
10.24.51.190, 127.0.0.1, 2620:100:0:fe07:f92c:9dbd:9b0f:9

RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-15 Thread dkarachentsev
Hi Roger,

The recovery message in logs is normal case when node was forced to stop.
This only means that data are restoring from WAL on start. 

Slow activation doesn't look OK, it shouldn't take so long. Could you please
restart grid with -DIGNITE_QUIET=false JVM flag and share logs?

Thanks!
-Dmitry.



--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16197.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.


Re: RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-15 Thread Alexey Kukushkin
I just saw this "Ignite node crashed in the middle of checkpoint" on my 
development machine with the latest Ignite 2.1.4 - it appeared when activating 
a single node cluster with persistence enabled but no data to preload at all. I 
will also look into it after I complete my current tasks. 
Best regards, Alexey


On Tuesday, August 15, 2017, 3:39:57 AM GMT+3, Roger Fischer (CW) 
 wrote:

#yiv8195883399 #yiv8195883399 -- _filtered #yiv8195883399 {panose-1:2 4 5 3 5 4 
6 3 2 4;} _filtered #yiv8195883399 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 
3 2 4;} _filtered #yiv8195883399 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 
2 4;}#yiv8195883399 #yiv8195883399 p.yiv8195883399MsoNormal, #yiv8195883399 
li.yiv8195883399MsoNormal, #yiv8195883399 div.yiv8195883399MsoNormal 
{margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8195883399 a:link, 
#yiv8195883399 span.yiv8195883399MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv8195883399 a:visited, #yiv8195883399 
span.yiv8195883399MsoHyperlinkFollowed 
{color:purple;text-decoration:underline;}#yiv8195883399 
span.yiv8195883399EmailStyle17 {color:#1F497D;}#yiv8195883399 
.yiv8195883399MsoChpDefault {} _filtered #yiv8195883399 {margin:1.0in 1.0in 
1.0in 1.0in;}#yiv8195883399 div.yiv8195883399WordSection1 {}#yiv8195883399 
Hi Alex,
 
  
 
there were no other relevant logs than what I already listed in the first email.
 
  
 
http://www.springframework.org/schema/beans";
 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
 
   xsi:schemaLocation="
 
   http://www.springframework.org/schema/beans
 
   http://www.springframework.org/schema/beans/spring-beans.xsd";>
 
  
 
        
 
    
 
    
 
    
 
    
 
    
 
    10.24.51.190
 
    10.24.51.187
 
    10.24.51.150
 
    
 
    
 
    
 
    
 
   
 
    
 
  
 
    
 
    
 
    
 
  
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
  
 
    
 
    
 
    
 
    
 
    
 
    dateTime
 
    portId
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    portId
 
    dateTime
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    switchId
 
    dateTime
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 
    
 

 
  
 
All 3 servers (and the client) are on VMs on the same host. No network hop 
latency. But all 3 VMs use the same physical disk (o

RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-14 Thread Roger Fischer (CW)
Hi Alex,

there were no other relevant logs than what I already listed in the first email.

http://www.springframework.org/schema/beans";
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance";
   xsi:schemaLocation="
   http://www.springframework.org/schema/beans
   http://www.springframework.org/schema/beans/spring-beans.xsd";>







10.24.51.190
10.24.51.187
10.24.51.150




   










































dateTime
portId







portId
dateTime







switchId
dateTime















All 3 servers (and the client) are on VMs on the same host. No network hop 
latency. But all 3 VMs use the same physical disk (on the host).

Servers have 16 GB of RAM. Data on disk (work/db) was about 35 GB per mode. 
About 36M objects.

Please also note 
http://apache-ignite-users.70518.x6.nabble.com/Strange-problems-with-Ignite-native-Persistence-when-Data-exceeds-Memory-td16187.html.
There were some odd problems at time that may have affected the activation.

Roger


From: afedotov [mailto:alexander.fedot...@gmail.com]
Sent: Monday, August 14, 2017 11:05 AM
To: user@ignite.apache.org
Subject: Re: Activation: slow and: Ignite node crashed in the middle of 
checkpoint.

Hi,

Could you please share the logs and configuration?
Actually, the activation time depends on the amount of data, network 
connectivity and other variables.

Kind regards,
Alex.

On Sat, Aug 12, 2017 at 12:39 AM, Roger Fischer (CW) [via Apache Ignite Users] 
<[hidden email]> wrote:
Hello,

I am wondering if the following behavior is typical, or if it represents a 
concern.

I have a 3 node cluster with native persistence. Each node as 4 CPU and 16 GB 
of RAM.
Each node has ~45 GB used in work/db. Total across the 3 nodes is about 36.5 M 
objects.
I am using SQL queries, and there are 3 indexes.

The servers start up normally and join the cluster, as expected.

When I start the client, which calls active(), all 3 servers report the 
following:

[12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16, heap=4.8GB]
[12:41:29] Default checkpoint page buffer size is too small, setting to an 
adjusted value: 2.0 GiB
[12:41:29] Ignite node c

Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-14 Thread afedotov
Hi,

Could you please share the logs and configuration?
Actually, the activation time depends on the amount of data, network
connectivity and other variables.


Kind regards,
Alex.

On Sat, Aug 12, 2017 at 12:39 AM, Roger Fischer (CW) [via Apache Ignite
Users]  wrote:

> Hello,
>
>
>
> I am wondering if the following behavior is typical, or if it represents a
> concern.
>
>
>
> I have a 3 node cluster with native persistence. Each node as 4 CPU and 16
> GB of RAM.
>
> Each node has ~45 GB used in work/db. Total across the 3 nodes is about
> 36.5 M objects.
>
> I am using SQL queries, and there are 3 indexes.
>
>
>
> The servers start up normally and join the cluster, as expected.
>
>
>
> When I start the client, which calls active(), all 3 servers report the
> following:
>
>
>
> [12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16,
> heap=4.8GB]
>
> [12:41:29] Default checkpoint page buffer size is too small, setting to an
> adjusted value: 2.0 GiB
>
> [12:41:29] Ignite node crashed in the middle of checkpoint. Will restore
> memory state and enforce checkpoint on node start.
>
>
>
> 1) Should I worry about the “crashed” log?
>
>
>
> The activation takes more than 30 minutes (until active() returns).
>
>
>
> 2) Is that normal for activate to take that long?
>
>
>
> ver. 2.1.0#20170720-sha1:a6ca5c8a
>
> OS: Linux 3.10.0-514.el7.x86_64 amd64
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
>
> --
> If you reply to this email, your message will be added to the discussion
> below:
> http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-
> node-crashed-in-the-middle-of-checkpoint-tp16144.html
> To start a new topic under Apache Ignite Users, email
> ml+s70518n1...@n6.nabble.com
> To unsubscribe from Apache Ignite Users, click here
> <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YWxleGFuZGVyLmZlZG90b2ZmQGdtYWlsLmNvbXwxfC0xMzYxNTU0NTg=>
> .
> NAML
> <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: 
http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16176.html
Sent from the Apache Ignite Users mailing list archive at Nabble.com.

Activation: slow and: Ignite node crashed in the middle of checkpoint.

2017-08-11 Thread Roger Fischer (CW)
Hello,

I am wondering if the following behavior is typical, or if it represents a 
concern.

I have a 3 node cluster with native persistence. Each node as 4 CPU and 16 GB 
of RAM.
Each node has ~45 GB used in work/db. Total across the 3 nodes is about 36.5 M 
objects.
I am using SQL queries, and there are 3 indexes.

The servers start up normally and join the cluster, as expected.

When I start the client, which calls active(), all 3 servers report the 
following:

[12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16, heap=4.8GB]
[12:41:29] Default checkpoint page buffer size is too small, setting to an 
adjusted value: 2.0 GiB
[12:41:29] Ignite node crashed in the middle of checkpoint. Will restore memory 
state and enforce checkpoint on node start.

1) Should I worry about the "crashed" log?

The activation takes more than 30 minutes (until active() returns).

2) Is that normal for activate to take that long?

ver. 2.1.0#20170720-sha1:a6ca5c8a
OS: Linux 3.10.0-514.el7.x86_64 amd64

Thanks...

Roger