persistent caches not rebalancing when new node is added
I have a 16 node Ignite (v2.10.0) cluster with persistence enabled, and about 20 caches, all of which are configured as cacheMode = partitioned, backups = 1, with a rebalanceMode of ASYNC and rebalanceDelay of -1 (such that rebalancing will only happen manually). The auto baseline adjustment feature is disabled. The cluster uses TcpDiscoveryVmIpFinder and each of the 16 nodes has a list of all 16 ip addresses. I want to expand the cluster and add a 17th node and rebalance the data accordingly. In the new node, I update the config to include all 16 nodes plus itself, then start it up. Using ./control.sh --baseline on one of the original 16 nodes, I see all 16 nodes in the baseline, plus the new one in a different section at the bottom (e.g. not yet part of the baseline). I run ./control.sh --baseline add , and it seems to work, as I now have 17 nodes in the baseline topology, and the metrics that are logged out every minute from each node indicate that there are now 17 servers. I see these same logs/info on the new node as well as the 16 original ones. On the newly added node, I see logs like these after updating the baseline topology: Local state for group durability has changed [name=MyCache1Name, enabled=false] Local state for group durability has been logged to WAL [name=MyCache1Name, enabled=false] ... Prepared rebalancing [grp=ignite-sys-cache, mode=SYNC, supplier=...] ... Starting rebalance routine [grp=ignite-sys-cache, mode=SYNC, supplier=...] ... Completed rebalancing [rebalanceId=42, grp=ignite-sys-cache, supplier=...] Local state for group durability has changed [name=ignite-sys-cache, enabled=true] I don't know what ignite-sys-cache is, but this all seems fine and good, but my actual caches are not rebalanced and I have no data for them on this new node. I tried using ignite.cache(cacheName).rebalance() on all of my caches, but that also appeared to have no effect, even after sitting overnight. Is there something I'm missing with regards to how cluster expansion, rebalancing, and baseline topology works? I've tried for a couple weeks to get this working with no success. The official docs don't say much on the subject other than 'update the baseline topology and data rebalancing should occur based on your rebalanceMode and rebalanceDelay settings'.
data rebalancing and partition map exchange with persistence
I'm using Ignite 2.9.1, a 5 node cluster with persistence enabled, partitioned caches with 1 backup. I'm a bit confused about the difference between data rebalancing and partition map exchange in this context. 1. Does data rebalancing occur when a node leaves or joins, or only when you manually change the baseline topology (assuming automatic baseline adjustment is disabled)? Again, this is on a cluster with persistence enabled. 2. Sometimes I look at the partition counts of a cache across all the nodes using Arrays.stream(ignite.affinity(cacheName).primaryPartitions(severNode) and I see 0 partitions on one or even two nodes for some of the caches. After a while it returns to a balanced state. What's going on here? Is this data rebalancing at work, or is this the result of the partition map exchange process determining that one node is/was down and thus switching to use the backup partitions? 3. Is there a way to manually invoke the partition map exchange process? I figured it would happen on cluster restart, but even after restarting the cluster and seeing all baseline nodes connect I still observe the partition imbalance. It often takes hours for this to resolve. 4. Sometimes I see 'partition lost' errors. If i am using persistence and all the baseline nodes are online and connected, is it safe to assume no data has been lost and just call cache.resetLostPartitions(myCaches)? Is there a way calling that method would lead to data loss with persistence enabled? thanks for your help!
ClassNotFoundException using peer class loading on cluster
I'm using peer class loading on a 5 node ignite cluster, persistence enabled, Ignite version 2.8.1. I have a custom class that implements IgniteRunnable and I launch that class on the cluster. This works fine when deploying to an ignite node running on a single node cluster locally, but fails with a ClassNotFound exception (on my custom IgniteRunnable class) on the 5 node cluster. I can see a reference to this class name in both the work-dir/marshaller and work-dir/binary_meta directories on each cluster node, so it seems like the class should be there. I have many other IgniteRunnables and distributed closures that all work fine -- this is the only one giving me trouble.I tried renaming the class, but that didn't help either. After nearly three days, I'm running out of ideas (other than giving up and statically deploying the jar to each node, which I really want to avoid), and I'm looking for advice on how to troubleshoot an issue like this. Thanks for your help, Alan
Re: IgniteCache.size() is hanging
Sorry, meant 2.7.6, not 2.7.3 On Tue, Sep 29, 2020 at 7:40 AM Alan Ward wrote: > I wish I could -- this cluster is running on an isolated network and I > can't get the logs or configs or anything down to the Internet. > > But, I just figured out the problem -- I had set a very large value for > failureDetectionTimeout (default is 10s). When I reverted that to the > default, everything started working great. > > This is interesting, because in 2.7.3, bumping up this setting didn't > cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few > times (using the same config w/ the large failureDetectionTimeout) and was > able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1. > > Hopefully this helps someone else out there, > > Alan > > > > On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov < > aealexsand...@gmail.com> wrote: > >> Hi, >> >> Highly likely some of the nodes go offline and try to connect again. >> Probably you had some network issues. I think I will see this and other >> information in the logs. Can you provide them? >> >> BR, >> Andrei >> 9/24/2020 6:54 PM, Alan Ward пишет: >> >> The only log I see is from one of the server nodes, which is spewing at a >> very high rate: >> >> [grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming >> communication connection [locAddr=/:47100, rmtAddr=: >> >> Note that each time the log is printed, i see a different value for >> . >> >> Also note that I only see these logs when i try to run ignitevisorcmd's >> "cache" command. When I run the java application that calls >> IgniteCache.size(), I don't see any such logs. But in both cases, the >> result is that the operation is just hanging. >> >> The cluster is active and I am able to insert data (albeit at a pretty >> slow rate), so it's not like things are completely non-functional. It's >> really confusing :\ >> >> On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov >> wrote: >> >>> Hi, >>> >>> Can you please provide the full server logs? >>> >>> BR, >>> Andrei >>> >>> >>> >>> -- >>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >>> >>
Re: IgniteCache.size() is hanging
I wish I could -- this cluster is running on an isolated network and I can't get the logs or configs or anything down to the Internet. But, I just figured out the problem -- I had set a very large value for failureDetectionTimeout (default is 10s). When I reverted that to the default, everything started working great. This is interesting, because in 2.7.3, bumping up this setting didn't cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few times (using the same config w/ the large failureDetectionTimeout) and was able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1. Hopefully this helps someone else out there, Alan On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov wrote: > Hi, > > Highly likely some of the nodes go offline and try to connect again. > Probably you had some network issues. I think I will see this and other > information in the logs. Can you provide them? > > BR, > Andrei > 9/24/2020 6:54 PM, Alan Ward пишет: > > The only log I see is from one of the server nodes, which is spewing at a > very high rate: > > [grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming > communication connection [locAddr=/:47100, rmtAddr=: > > Note that each time the log is printed, i see a different value for > . > > Also note that I only see these logs when i try to run ignitevisorcmd's > "cache" command. When I run the java application that calls > IgniteCache.size(), I don't see any such logs. But in both cases, the > result is that the operation is just hanging. > > The cluster is active and I am able to insert data (albeit at a pretty > slow rate), so it's not like things are completely non-functional. It's > really confusing :\ > > On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov > wrote: > >> Hi, >> >> Can you please provide the full server logs? >> >> BR, >> Andrei >> >> >> >> -- >> Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >> >
Re: IgniteCache.size() is hanging
The only log I see is from one of the server nodes, which is spewing at a very high rate: [grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/:47100, rmtAddr=: Note that each time the log is printed, i see a different value for . Also note that I only see these logs when i try to run ignitevisorcmd's "cache" command. When I run the java application that calls IgniteCache.size(), I don't see any such logs. But in both cases, the result is that the operation is just hanging. The cluster is active and I am able to insert data (albeit at a pretty slow rate), so it's not like things are completely non-functional. It's really confusing :\ On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov wrote: > Hi, > > Can you please provide the full server logs? > > BR, > Andrei > > > > -- > Sent from: http://apache-ignite-users.70518.x6.nabble.com/ >
IgniteCache.size() is hanging
[image: Selected post] I'm running a 5 node Ignite cluster, version 2.8.1, with persistence enabled and a small number of partitioned caches, ranging between a few thousand records to one cache with over 1 billion records. No SQL use. When I run a Java client app and connect to the cluster (with clientMode = true), I connect fine and can retrieve the names of all caches on the cluster quickly. However, attempting to get the size of a cache via ignite.getOrCreateCache("existingCacheName").size() just hangs. This happens regardless of which cache I try to get the size of. Sometimes I see a suspicious warning after a minute or so: WARNING: Node FAILED: TcpDiscoveryNode[...] - it appears to be referencing my client node. I don't know why the node failed, what to do about it, or why it seems to happen so frequently. There are no relevant logs coming from any of the ignite server nodes, nor the java app/client. There are also many times when I do not get a Node FAILED warning, but still the size() operation just hangs with no other information. Thanks for your help! Alan
Ignoring model fields
Is there a way (preferably annotation-based) to exclude certain fields in user-defined model classes from Ignite (cache, query, etc.), similar to how Jackson has a @JsonIgnore annotation to exclude a field from serialization/deserialization. Thanks, Alan
local deployment of web-console-standalone
I'm trying to get a local deployment of the web console working via docker. I have the latest 2.7.0 version of the web-console-standalone docker image, started with "docker run -d -p 8080:80 --name web-console-standalone -e DEBUG=* apacheignite/web-console-standalone The container starts up fine, and I see "Start listening on 127.0.0.1:3000 in the logs". When I try to access the web console via a browser at http://:8080/, it connects, but I get a "Loading..." indicator that never goes away - the page is otherwise blank. There are no errors being logged from the container, and no obvious problems via the firefox dev tools/Networking window. This is on an enterprise network with no Internet access. I've also noticed that if I go into the container and copy /opt/web-console/backend/agent_dists/ignite-web-agent-2.7.0.zip out onto my host box and unzip it, there is no "default.properties" file as the documentation seems to indicate there should be. I tried starting up the web-agent via the resulting ignite-web-agent.sh script, and it fails due to the security tokens not matching. It seems those tokens should be available in the web-console, but again, I can't load the /profile page to view them. Thanks for the help!