Re: Ignite node crashed
Lo, Marcus, hi ! Seems problem really due to long gc pause. Do you apply all suggestions from [1] [2] ? [1] https://apacheignite.readme.io/docs/jvm-and-system-tuning [2] https://apacheignite.readme.io/docs/jvm-and-system-tuning#memory-issues >Hi, > >I have a 5 node Ignite cluster setup, and it seems that when I start to create >table in the cluster, one of the node would crash. All of the nodes are VM >with 8 CPUs and 128GB of memory. I have attached the log file, gc file and >also the xml config for the crashing node (with default data region of 90GB, >and heap size of 10GB). I can see the node having a long GC starting from >04:29:58, but unfortunately the gc log doesn’t show anything at that time. Can >you please shed some light on the issue? Thanks. > >Regards, >Marcus >
Re: Ignite Node crashed in middle of checkpoint and data loss
Hello! Can you share your data files (WAL and db) so that we could try and reproduce the crash? If it is not feasible my recommendation is to try bring this data up when starting with Nightly Build instead of 2.7: https://ignite.apache.org/download.cgi#nightly-builds Regards, -- Ilya Kasnacheev вт, 19 февр. 2019 г. в 09:03, garima.j : > Hello, > > We have an ignite cluster of 3 nodes (16GB RAM, 50GB disk space each)and > have given 10GB (off-heap) to data region and (Xms) 2GB and (Xmx) 3GB to > the > nodes. > > One node went down and while restarting the node, I get the exception that > Ignite node crashed in the middle of checkpoint and JVM crash after that. > > Ignite configuration : > > class="org.apache.ignite.configuration.TransactionConfiguration"> > > value="2"/> > > > > > > > > Data Storage configuration : > > class="org.apache.ignite.configuration.DataStorageConfiguration"> > > class="org.apache.ignite.configuration.DataRegionConfiguration"> > > > > > value="#{1L * 1024 * 1024 * 1024}"/> > value="RANDOM_2_LRU"/> > > > value="/data1/data/datastore"/> > > value="/data2/data/wal/archive"/> > > > > > > > > > > Cache configuration : > > > > > > > > > value="TRANSACTIONAL_SNAPSHOT"/> > > Please find the logs(FINE level) and JVM crash logs. > > hs_err_pid6456.log > < > http://apache-ignite-users.70518.x6.nabble.com/file/t2241/hs_err_pid6456.log> > > ignite-8393e373.log > < > http://apache-ignite-users.70518.x6.nabble.com/file/t2241/ignite-8393e373.log> > > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
Ignite Node crashed in middle of checkpoint and data loss
Hello, We have an ignite cluster of 3 nodes (16GB RAM, 50GB disk space each)and have given 10GB (off-heap) to data region and (Xms) 2GB and (Xmx) 3GB to the nodes. One node went down and while restarting the node, I get the exception that Ignite node crashed in the middle of checkpoint and JVM crash after that. Ignite configuration : Data Storage configuration : Cache configuration : Please find the logs(FINE level) and JVM crash logs. hs_err_pid6456.log <http://apache-ignite-users.70518.x6.nabble.com/file/t2241/hs_err_pid6456.log> ignite-8393e373.log <http://apache-ignite-users.70518.x6.nabble.com/file/t2241/ignite-8393e373.log> -- Sent from: http://apache-ignite-users.70518.x6.nabble.com/
Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi, Please share full logs and thread dumps, it's hard to understand the root cause. Thanks! -Dmitry. -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16341.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi Roger, I have experienced a similar issue during cluster activation in my setup as well. I had shared my logs here - http://apache-ignite-users.70518.x6.nabble.com/Activating-Cluster-taking-too-long-td16093.html Eagerly seeking a root cause and resolution for this. -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16318.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi Dmitry and Alex, the cache contains 19.2M objects. The work/db directory is 23, 26 and 22 GB respectively. The 3 nodes have 8 GB RAM each. I initiated deactivate at 14:13:39. As of 16:50:00, deactivate has not completed. Only server node 2 continues to log warnings. The client shows the following logs: [14:13:39,473][INFO][main][GridClusterStateProcessor] Sending deactivate request from node [id=548f4233-67e9-4043-aa3a-086fb541c427, topVer=AffinityTopologyVersion [topVer=12, minorTopVer=0], client=true, daemonfalse] [14:13:40,369][INFO][tcp-client-disco-msg-worker-#4%null%][GridClusterStateProcessor] Start state transition: false [14:13:40,395][INFO][exchange-worker-#96%null%][time] Started exchange init [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], crd=false, evt=18, node=TcpDiscoveryNode [id=548f4233-67e9-4043-aa3a-086fb541c427, addrs=[0:0:0:0:0:0:0:1%lo, 10.24.51.187, 127.0.0.1, 2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3:0, rfische-2.englab.brocade.com/10.24.51.187:0], discPort=0, order=11, intOrder=0, lastExchangeTime=1502830168939, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=true], evtNode=TcpDiscoveryNode [id=548f4233-67e9-4043-aa3a-086fb541c427, addrs=[0:0:0:0:0:0:0:1%lo, 10.24.51.187, 127.0.0.1, 2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /2620:100:0:fe07:ed4c:b7b8:f80c:9bef%enp0s3:0, rfische-2.englab.brocade.com/10.24.51.187:0], discPort=0, order=11, intOrder=0, lastExchangeTime=1502830168939, loc=true, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=true], customEvt=ChangeGlobalStateMessage [id=c90edf2ed51-d51246ce-e4d1-46f7-b156-f1ceac90bb7a, reqId=5505420d-5d31-4d2c-b0ae-fa7a77629d2d, initiatingNodeId=bda65979-33d1-4d6f-8a32-45b71255f835, activate=false]] [14:13:40,396][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture] Start deactivation process [nodeId=548f4233-67e9-4043-aa3a-086fb541c427, client=true, topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1]] [14:13:40,397][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture] Successfully deactivated data structures, services and caches [nodeId=548f4233-67e9-4043-aa3a-086fb541c427, client=true, topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1]] [14:13:40,398][INFO][exchange-worker-#96%null%][GridDhtPartitionsExchangeFuture] Snapshot initialization completed [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], time=0ms] [14:13:40,398][INFO][exchange-worker-#96%null%][time] Finished exchange init [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], crd=false] [14:13:41,173][INFO][tcp-client-disco-msg-worker-#4%null%][GridClusterStateProcessor] Received state change finish message: false [14:13:45,355][INFO][grid-timeout-worker-#15%null%][IgniteKernal] Metrics for local node (to disable set 'metricsLogFrequency' to 0) ^-- Node [id=548f4233, name=null, uptime=00:24:07:982] ^-- H/N/C [hosts=3, nodes=4, CPUs=12] ^-- CPU [cur=0.6%, avg=0.89%, GC=0%] ^-- PageMemory [pages=0] ^-- Heap [used=248MB, free=86.36%, comm=951MB] ^-- Non heap [used=48MB, free=-1%, comm=49MB] ^-- Public thread pool [active=0, idle=0, qSize=0] ^-- System thread pool [active=0, idle=0, qSize=0] ^-- Outbound messages queue [size=0] [14:13:50,399][WARNING][exchange-worker-#96%null%][diagnostic] Failed to wait for partition map exchange [topVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], node=548f4233-67e9-4043-aa3a-086fb541c427]. Dumping pending objects that might be the cause: [14:13:50,400][WARNING][exchange-worker-#96%null%][diagnostic] Ready affinity version: AffinityTopologyVersion [topVer=12, minorTopVer=0] [14:13:50,624][WARNING][exchange-worker-#96%null%][diagnostic] Last exchange future: GridDhtPartitionsExchangeFuture [dummy=false, forcePreload=false, reassign=false, discoEvt=DiscoveryCustomEvent [customMsg=ChangeGlobalStateMessage [id=c90edf2ed51-d51246ce-e4d1-46f7-b156-f1ceac90bb7a, reqId=5505420d-5d31-4d2c-b0ae-fa7a77629d2d, initiatingNodeId=bda65979-33d1-4d6f-8a32-45b71255f835, activate=false], affTopVer=AffinityTopologyVersion [topVer=12, minorTopVer=1], super=DiscoveryEvent [evtNode=TcpDiscoveryNode [id=bda65979-33d1-4d6f-8a32-45b71255f835, addrs=[0:0:0:0:0:0:0:1%lo, 10.24.51.190, 127.0.0.1, 2620:100:0:fe07:f92c:9dbd:9b0f:9982%enp0s3], sockAddrs=[/2620:100:0:fe07:f92c:9dbd:9b0f:9982%enp0s3:47500, rfische-1.englab.brocade.com/10.24.51.190:47500, /0:0:0:0:0:0:0:1%lo:47500, /127.0.0.1:47500], discPort=47500, order=1, intOrder=1, lastExchangeTime=1502831381649, loc=false, ver=2.1.0#20170720-sha1:a6ca5c8a, isClient=false], topVer=12, nodeId8=548f4233, msg=null, type=DISCOVERY_CUSTOM_EVT, tstamp=1502831620392]], crd=TcpDiscoveryNode [id=bda65979-33d1-4d6f-8a32-45b71255f835, addrs=[0:0:0:0:0:0:0:1%lo, 10.24.51.190, 127.0.0.1, 2620:100:0:fe07:f92c:9dbd:9b0f:9
RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi Roger, The recovery message in logs is normal case when node was forced to stop. This only means that data are restoring from WAL on start. Slow activation doesn't look OK, it shouldn't take so long. Could you please restart grid with -DIGNITE_QUIET=false JVM flag and share logs? Thanks! -Dmitry. -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16197.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Re: RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.
I just saw this "Ignite node crashed in the middle of checkpoint" on my development machine with the latest Ignite 2.1.4 - it appeared when activating a single node cluster with persistence enabled but no data to preload at all. I will also look into it after I complete my current tasks. Best regards, Alexey On Tuesday, August 15, 2017, 3:39:57 AM GMT+3, Roger Fischer (CW) wrote: #yiv8195883399 #yiv8195883399 -- _filtered #yiv8195883399 {panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv8195883399 {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;} _filtered #yiv8195883399 {font-family:Tahoma;panose-1:2 11 6 4 3 5 4 4 2 4;}#yiv8195883399 #yiv8195883399 p.yiv8195883399MsoNormal, #yiv8195883399 li.yiv8195883399MsoNormal, #yiv8195883399 div.yiv8195883399MsoNormal {margin:0in;margin-bottom:.0001pt;font-size:12.0pt;}#yiv8195883399 a:link, #yiv8195883399 span.yiv8195883399MsoHyperlink {color:blue;text-decoration:underline;}#yiv8195883399 a:visited, #yiv8195883399 span.yiv8195883399MsoHyperlinkFollowed {color:purple;text-decoration:underline;}#yiv8195883399 span.yiv8195883399EmailStyle17 {color:#1F497D;}#yiv8195883399 .yiv8195883399MsoChpDefault {} _filtered #yiv8195883399 {margin:1.0in 1.0in 1.0in 1.0in;}#yiv8195883399 div.yiv8195883399WordSection1 {}#yiv8195883399 Hi Alex, there were no other relevant logs than what I already listed in the first email. http://www.springframework.org/schema/beans"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd";> 10.24.51.190 10.24.51.187 10.24.51.150 dateTime portId portId dateTime switchId dateTime All 3 servers (and the client) are on VMs on the same host. No network hop latency. But all 3 VMs use the same physical disk (o
RE: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi Alex, there were no other relevant logs than what I already listed in the first email. http://www.springframework.org/schema/beans"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation=" http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd";> 10.24.51.190 10.24.51.187 10.24.51.150 dateTime portId portId dateTime switchId dateTime All 3 servers (and the client) are on VMs on the same host. No network hop latency. But all 3 VMs use the same physical disk (on the host). Servers have 16 GB of RAM. Data on disk (work/db) was about 35 GB per mode. About 36M objects. Please also note http://apache-ignite-users.70518.x6.nabble.com/Strange-problems-with-Ignite-native-Persistence-when-Data-exceeds-Memory-td16187.html. There were some odd problems at time that may have affected the activation. Roger From: afedotov [mailto:alexander.fedot...@gmail.com] Sent: Monday, August 14, 2017 11:05 AM To: user@ignite.apache.org Subject: Re: Activation: slow and: Ignite node crashed in the middle of checkpoint. Hi, Could you please share the logs and configuration? Actually, the activation time depends on the amount of data, network connectivity and other variables. Kind regards, Alex. On Sat, Aug 12, 2017 at 12:39 AM, Roger Fischer (CW) [via Apache Ignite Users] <[hidden email]> wrote: Hello, I am wondering if the following behavior is typical, or if it represents a concern. I have a 3 node cluster with native persistence. Each node as 4 CPU and 16 GB of RAM. Each node has ~45 GB used in work/db. Total across the 3 nodes is about 36.5 M objects. I am using SQL queries, and there are 3 indexes. The servers start up normally and join the cluster, as expected. When I start the client, which calls active(), all 3 servers report the following: [12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16, heap=4.8GB] [12:41:29] Default checkpoint page buffer size is too small, setting to an adjusted value: 2.0 GiB [12:41:29] Ignite node c
Re: Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hi, Could you please share the logs and configuration? Actually, the activation time depends on the amount of data, network connectivity and other variables. Kind regards, Alex. On Sat, Aug 12, 2017 at 12:39 AM, Roger Fischer (CW) [via Apache Ignite Users] wrote: > Hello, > > > > I am wondering if the following behavior is typical, or if it represents a > concern. > > > > I have a 3 node cluster with native persistence. Each node as 4 CPU and 16 > GB of RAM. > > Each node has ~45 GB used in work/db. Total across the 3 nodes is about > 36.5 M objects. > > I am using SQL queries, and there are 3 indexes. > > > > The servers start up normally and join the cluster, as expected. > > > > When I start the client, which calls active(), all 3 servers report the > following: > > > > [12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16, > heap=4.8GB] > > [12:41:29] Default checkpoint page buffer size is too small, setting to an > adjusted value: 2.0 GiB > > [12:41:29] Ignite node crashed in the middle of checkpoint. Will restore > memory state and enforce checkpoint on node start. > > > > 1) Should I worry about the “crashed” log? > > > > The activation takes more than 30 minutes (until active() returns). > > > > 2) Is that normal for activate to take that long? > > > > ver. 2.1.0#20170720-sha1:a6ca5c8a > > OS: Linux 3.10.0-514.el7.x86_64 amd64 > > > > Thanks… > > > > Roger > > > > > -- > If you reply to this email, your message will be added to the discussion > below: > http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite- > node-crashed-in-the-middle-of-checkpoint-tp16144.html > To start a new topic under Apache Ignite Users, email > ml+s70518n1...@n6.nabble.com > To unsubscribe from Apache Ignite Users, click here > <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=1&code=YWxleGFuZGVyLmZlZG90b2ZmQGdtYWlsLmNvbXwxfC0xMzYxNTU0NTg=> > . > NAML > <http://apache-ignite-users.70518.x6.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml> > -- View this message in context: http://apache-ignite-users.70518.x6.nabble.com/Activation-slow-and-Ignite-node-crashed-in-the-middle-of-checkpoint-tp16144p16176.html Sent from the Apache Ignite Users mailing list archive at Nabble.com.
Activation: slow and: Ignite node crashed in the middle of checkpoint.
Hello, I am wondering if the following behavior is typical, or if it represents a concern. I have a 3 node cluster with native persistence. Each node as 4 CPU and 16 GB of RAM. Each node has ~45 GB used in work/db. Total across the 3 nodes is about 36.5 M objects. I am using SQL queries, and there are 3 indexes. The servers start up normally and join the cluster, as expected. When I start the client, which calls active(), all 3 servers report the following: [12:41:28] Topology snapshot [ver=5, servers=3, clients=1, CPUs=16, heap=4.8GB] [12:41:29] Default checkpoint page buffer size is too small, setting to an adjusted value: 2.0 GiB [12:41:29] Ignite node crashed in the middle of checkpoint. Will restore memory state and enforce checkpoint on node start. 1) Should I worry about the "crashed" log? The activation takes more than 30 minutes (until active() returns). 2) Is that normal for activate to take that long? ver. 2.1.0#20170720-sha1:a6ca5c8a OS: Linux 3.10.0-514.el7.x86_64 amd64 Thanks... Roger