Re: Partition states validation has filed for group: CUSTOMER_KV

2021-09-11 Thread Pavel Kovalenko
Hi Naveen, I think just stopping updates is not enough to make a consistent snapshot of the partition stores. You must ensure that all updates are also checkpointed to disk. Otherwise, to restore a valid snapshot you must copy WAL as well as partition stores. You can try to deactivate the source

Re: Regarding Partition Map exchange Triggers

2020-12-12 Thread Pavel Kovalenko
ion-td34823.html > > Now it seems that the relevant explanations are confusing? > 在 2020/12/11 下午8:21, Pavel Kovalenko 写道: > > Hi, > > I think it's wrong information on the wiki that PME is not triggered for > some cases. It should be fixed. > Actually, PME is triggered in all ca

Re: Regarding Partition Map exchange Triggers

2020-12-11 Thread Pavel Kovalenko
Your thoughts are right. If cache exists no PME will be started. If doesn't exist - getOrCreate() method will create it and start PME, cache() method will throw an exception or return null (doesn't remember what exactly) пт, 11 дек. 2020 г. в 17:48, VeenaMithare : > HI Pavel, > > Thank you for

Re: Regarding Partition Map exchange Triggers

2020-12-11 Thread Pavel Kovalenko
Hi, I think it's wrong information on the wiki that PME is not triggered for some cases. It should be fixed. Actually, PME is triggered in all cases but for some of them it doesn't block cache operations or the time of blocking is minimized. Most optimizations for minimizing the blocking time of

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2020-05-01 Thread Pavel Kovalenko
Hello, I don't clearly understand from your message, but have the exchange finally finished? Or you were getting this WARN message all the time? пт, 1 мая 2020 г. в 12:32, Ilya Kasnacheev : > Hello! > > This description sounds like a typical hanging Partition Map Exchange, but > you should be

Re: excessive timeouts and load on new cache creations

2019-11-22 Thread Pavel Kovalenko
Hi Ibrahim, I see you have 317 cache groups in your cluster `Full map updating for 317 groups performed in 105 ms.` Each cache group has own partition map and affinity map that require memory which resides in old-gen. During cache creation, a distributed PME happens and all partition and affinity

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-11 Thread Pavel Kovalenko
Ibrahim, I've checked logs and found the following issue: [2019-09-27T15:00:06,164][ERROR][sys-stripe-32-#33][atomic] Received message without registered handler (will ignore) [msg=GridDhtAtomicDeferredUpdateResponse [futIds=GridLongList [idx=1, arr=[6389728]]],

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-10 Thread Pavel Kovalenko
Ibrahim, Could you please also share the cache configuration that is used for dynamic creation? чт, 10 окт. 2019 г. в 19:09, Pavel Kovalenko : > Hi Ibrahim, > > I see that one node didn't send acknowledgment during cache creation: > [2019-09-27T15:00:17,727][WARN > ][excha

Re: Cluster went down after "Unable to await partitions release latch within timeout" WARN

2019-10-10 Thread Pavel Kovalenko
Hi Ibrahim, I see that one node didn't send acknowledgment during cache creation: [2019-09-27T15:00:17,727][WARN ][exchange-worker-#219][GridDhtPartitionsExchangeFuture] Unable to await partitions release latch within timeout: ServerLatch [permits=1,

Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-07 Thread Pavel Kovalenko
Mahesh, Assertion error occurs if you run node with enabled assertions (jvm flag -ea). If assertions are disabled it leads to NullPointer exception as you have in logs. сб, 5 окт. 2019 г. в 16:47, Mahesh Renduchintala < mahesh.renduchint...@aline-consulting.com>: > Pavel, I don't have the logs

Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-04 Thread Pavel Kovalenko
Mahesh, Do you have logs from the following thick client? TcpDiscoveryNode [id=5204d16d-e6fc-4cc3-a1d9-17edf59f961e, addrs=[0:0:0:0:0:0:0:1%lo, 127.0.0.1, 192.168.1.171], sockAddrs=[/0:0:0:0:0:0:0:1%lo:0, /127.0.0.1:0, /192.168.1.171:0], discPort=0, order=1146, intOrder=579,

Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-03 Thread Pavel Kovalenko
Mahesh, According to your logs and exception what I see, the issue you mentioned is not related to your problem. The similar with IGNITE-10010 problem is https://issues.apache.org/jira/browse/IGNITE-9562 You have thick client join to server topology:

Re: GridCachePartitionExchangeManager Null pointer exception

2019-10-03 Thread Pavel Kovalenko
Hi Mahesh, Your problem is described here: https://issues.apache.org/jira/browse/IGNITE-12255 The section starts with "This solution showed the existing race between client node join and concurrent cache destroy." According to your logs, I see concurrent client node join and stop caches

Re: Using Ignite as blob store?

2019-08-23 Thread Pavel Kovalenko
Denis, You can't set page size greater than 16Kb due to our page memory limitations. чт, 22 авг. 2019 г. в 22:34, Denis Magda : > How about setting page size to more KBs or MBs based on the average value? > That should work perfectly fine. > > - > Denis > > > On Thu, Aug 22, 2019 at 8:11 AM

Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
This sounds strange. There definetely should be a cause of such behaviour. Rebalancing is happened only after an topology change (node join/leave, deactivation/activation). Could you please share logs from node with exception you mentioned in message, node with id

Re: Failed to send partition supply message to node: 5423e6b5-c9be-4eb8-8f68-e643357ec2b3 class org.apache.ignite.IgniteCheckedException: Could not find start pointer for partition

2018-12-26 Thread Pavel Kovalenko
Hello, It means that node with id "5423e6b5-c9be-4eb8-8f68-e643357ec2b3" has outdated data (possibly due to restart) and started to rebalance missed updates from a node with up-to-date data (where you have exception) using WAL. WAL rebalance is used when the number of entries in some partition

Re: Full GC in client after cluster become unstable with "Unable to await partitions release latch within timeout"

2018-12-24 Thread Pavel Kovalenko
Hello, Could you please attach additional logs from a coordinator node? An id of that node you may notice in "Unable to await partitions release latch" message. Also, it would be good to have logs from the client machine and from any other server node in the cluster. пн, 24 дек. 2018 г. в 09:13,

Re: ZookeeperDiscovery block when communication error

2018-11-19 Thread Pavel Kovalenko
Hello Wangsan, Seems it's known issue https://issues.apache.org/jira/browse/IGNITE-9493 . пн, 12 нояб. 2018 г. в 18:06, wangsan : > I have a server node in zone A ,then I start a client from zone B, Now > access > between A,B was controlled by firewall,The acl is B can access A,but A can > not

Re: Long activation times with Ignite persistence enabled

2018-11-06 Thread Pavel Kovalenko
Hi Naveen and Andrey, We've recently done major optimization https://issues.apache.org/jira/browse/IGNITE-9420 that will speed-up activation time in your case. Iteration over WAL now happens only on a node start-up, so it will not affect activation anymore. Partitions state restoring (which is

Re: Handling split brain with Zookeeper and persistence

2018-09-17 Thread Pavel Kovalenko
Hello Eugene, 1) Split brain resolver takes into account only server nodes (not client). No difference between in-memory only or with persistence. 2) It's no necessary to immediately remove a node from baseline topology after split-brain. If you lost backup factor for some partitions (All

Re: Partition map exchange in detail

2018-09-12 Thread Pavel Kovalenko
n one of the nodes. I would also expect the dead node to be > removed from the cluster, and no longer take part in PME. > > > > On Wed, Sep 12, 2018 at 11:25 AM Pavel Kovalenko > wrote: > >> Hi Eugene, >> >> Sorry, but I didn't catch the meaning of your quest

Re: a node fails and restarts in a cluster

2018-09-12 Thread Pavel Kovalenko
Hi Eugene, I've reproduced your problem and filed a ticket for that: https://issues.apache.org/jira/browse/IGNITE-9562 As a temporary workaround, I can suggest you delete persistence data (cache.dat and partition files) related to that cache in starting node work directory or don't destroy

Re: Partition map exchange in detail

2018-09-12 Thread Pavel Kovalenko
Hi Eugene, Sorry, but I didn't catch the meaning of your question about Zookeeper Discovery. Could you please re-phrase it? ср, 12 сент. 2018 г. в 17:54, Ilya Lantukh : > Pavel K., can you please answer about Zookeeper discovery? > > On Wed, Sep 12, 2018 at 5:49 PM, eugene miretsky < >

Re: a node fails and restarts in a cluster

2018-09-07 Thread Pavel Kovalenko
Hello Evgeny, Could you please attach full logs from both nodes in your case #2? Make sure, that quiet mode is disabled (-DIGNITE_QUIET=false) to have full info logs. пт, 7 сент. 2018 г. в 17:41, es70 : > I have a cluster of 2 ignite (version 2.6) nodes with enabled persistence > (at > the time

Re: Proper config for IGFS eviction

2018-08-10 Thread Pavel Kovalenko
Hello Engrdean, You should enable persistence on your DataRegionConfiguration to make it possible to evict files metadata pages from memory to disk. 2018-08-09 19:49 GMT+03:00 engrdean : > I've been struggling to find a configuration that works successfully for > IGFS > with hadoop filesystem

Re: Optimum persistent SQL storage and querying strategy

2018-08-08 Thread Pavel Kovalenko
Hello Jose, Did you consider Mongo DB for your use case? 2018-08-08 10:13 GMT+03:00 joseheitor : > Hi Ignite Team, > > Any tips and recommendations...? > > Jose > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-08-02 Thread Pavel Kovalenko
Hello Ray, I'm glad that your problem was resolved. I just want to add that on PME beginning phase we're waiting for all current client operations finishing, new operations are freezed till PME end. After node finishes all ongoing client operations it counts down latch that you see in logs which

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-26 Thread Pavel Kovalenko
Hello Ray, Without explicit errors in the log, it's not so easy to guess what was that. Because I don't see any errors, it should be a recoverable failure (even taking a long time). If you have such option, could you please enable DEBUG log level for

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-26 Thread Pavel Kovalenko
Hello Ray, It's hard to say that the issue you mentioned is the cause of your problem. To determine it, it will be very good if you get thread dumps on next such network glitch both from server and client nodes (using jstack e.g.). I'm not aware of Ignite Spark DataFrames implementation features,

Re: "Unable to await partitions release latch within timeout: ServerLatch" exception causing cluster freeze

2018-07-25 Thread Pavel Kovalenko
Hello Ray, According to your attached log, It seems that you have some network problems. Could you please also share logs from nodes with temporary ids = [429edc2b-eb14-414f-a978-9bfe35443c8c, 6783732c-9a13-466f-800a-ad4c8d9be3bf]. The root cause should be on those nodes. 2018-07-25 13:03

Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-15 Thread Pavel Kovalenko
David, No this problem exists in older versions also. пт, 15 июн. 2018 г. в 17:54, David Harvey : > Is https://issues.apache.org/jira/browse/IGNITE-8780 a regression in 2.5 ? > > On Thu, Jun 14, 2018 at 7:03 AM, Pavel Kovalenko > wrote: > >> DocDVZ, >>

Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-14 Thread Pavel Kovalenko
DocDVZ, Most probably you faced with the following issue https://issues.apache.org/jira/browse/IGNITE-8780. You can try to remove END file marker, in this case node will be recovered using WAL. чт, 14 июн. 2018 г. в 12:00, DocDVZ : > As i see, last checkpoint-end file, that invoked the problem,

Re: Ignite 2.5 uncatched BufferUnderflowException while reading WAL on startup

2018-06-09 Thread Pavel Kovalenko
Hello DocDVZ, What is your hardware environment? Do you use external / network storage device? 2018-06-09 15:14 GMT+03:00 DocDVZ : > Raw text blocks were discarded from message: > Service parameters: > ignite.sh -J-Xmx6g -J-Xms6g -J-XX:+AlwaysPreTouch -J-XX:+UseG1GC >

Re: Apache Ignite application deploy without rebalancing

2018-04-25 Thread Pavel Kovalenko
Hello, Most probably there is no actual rebalancing started and we fire REBALANCE_STARTED event ahead of time. Could you please turn on INFO log level for Ignite classes and check that after node shutdown a message "Skipping rebalancing" appears in logs? 2018-04-25 7:55 GMT+03:00 moon-duck