persistent caches not rebalancing when new node is added

2021-06-23 Thread Alan Ward
I have a 16 node Ignite (v2.10.0) cluster with persistence enabled, and
about 20 caches, all of which are configured as cacheMode = partitioned,
backups = 1, with a rebalanceMode of ASYNC and rebalanceDelay of -1 (such
that rebalancing will only happen manually). The auto baseline adjustment
feature is disabled. The cluster uses TcpDiscoveryVmIpFinder and each of
the 16 nodes has a list of all 16 ip addresses.

I want to expand the cluster and add a 17th node and rebalance the data
accordingly. In the new node, I update the config to include all 16 nodes
plus itself, then start it up. Using ./control.sh --baseline on one of the
original 16 nodes, I see all 16 nodes in the baseline, plus the new one in
a different section at the bottom (e.g. not yet part of the baseline). I
run ./control.sh --baseline add , and it seems to work, as I now
have 17 nodes in the baseline topology, and the metrics that are logged out
every minute from each node indicate that there are now 17 servers. I see
these same logs/info on the new node as well as the 16 original ones.

On the newly added node, I see logs like these after updating the baseline
topology:

Local state for group durability has changed [name=MyCache1Name,
enabled=false]
Local state for group durability has been logged to WAL [name=MyCache1Name,
enabled=false]
...
Prepared rebalancing [grp=ignite-sys-cache, mode=SYNC, supplier=...]
...
Starting rebalance routine [grp=ignite-sys-cache, mode=SYNC, supplier=...]
...
Completed rebalancing [rebalanceId=42, grp=ignite-sys-cache, supplier=...]
Local state for group durability has changed [name=ignite-sys-cache,
enabled=true]

I don't know what ignite-sys-cache is, but this all seems fine and good,
but my actual caches are not rebalanced and I have no data for them on this
new node. I tried using ignite.cache(cacheName).rebalance() on all of my
caches, but that also appeared to have no effect, even after sitting
overnight.

Is there something I'm missing with regards to how cluster expansion,
rebalancing, and baseline topology works? I've tried for a couple weeks to
get this working with no success. The official docs don't say much on the
subject other than 'update the baseline topology and data rebalancing
should occur based on your rebalanceMode and rebalanceDelay settings'.


data rebalancing and partition map exchange with persistence

2021-01-29 Thread Alan Ward
I'm using Ignite 2.9.1, a 5 node cluster with persistence enabled,
partitioned caches with 1 backup.

I'm a bit confused about the difference between data rebalancing and
partition map exchange in this context.

1. Does data rebalancing occur when a node leaves or joins, or only when
you manually change the baseline topology (assuming automatic baseline
adjustment is disabled)? Again, this is on a cluster with persistence
enabled.

2. Sometimes I look at the partition counts of a cache across all the nodes
using Arrays.stream(ignite.affinity(cacheName).primaryPartitions(severNode)
and I see 0 partitions on one or even two nodes for some of the caches.
After a while it returns to a balanced state. What's going on here? Is this
data rebalancing at work, or is this the result of the partition map
exchange process determining that one node is/was down and thus switching
to use the backup partitions?

3. Is there a way to manually invoke the partition map exchange process? I
figured it would happen on cluster restart, but even after restarting the
cluster and seeing all baseline nodes connect I still observe the partition
imbalance. It often takes hours for this to resolve.

4. Sometimes I see 'partition lost' errors. If i am using persistence and
all the baseline nodes are online and connected, is it safe to assume no
data has been lost and just call cache.resetLostPartitions(myCaches)? Is
there a way calling that method would lead to data loss with persistence
enabled?

thanks for your help!


ClassNotFoundException using peer class loading on cluster

2020-10-08 Thread Alan Ward
I'm using peer class loading on a 5 node ignite cluster, persistence
enabled, Ignite version 2.8.1. I have a custom class that implements
IgniteRunnable and I launch that class on the cluster. This works fine when
deploying to an ignite node running on a single node cluster locally, but
fails with a ClassNotFound exception (on my custom IgniteRunnable class) on
the 5 node cluster. I can see a reference to this class name in both the
work-dir/marshaller and work-dir/binary_meta directories on each cluster
node, so it seems like the class should be there.

I have many other IgniteRunnables and distributed closures that all work
fine -- this is the only one giving me trouble.I tried renaming the class,
but that didn't help either.

After nearly three days, I'm running out of ideas (other than giving up and
statically deploying the jar to each node, which I really want to avoid),
and I'm looking for advice on how to troubleshoot an issue like this.

Thanks for your help,

Alan


Re: IgniteCache.size() is hanging

2020-09-29 Thread Alan Ward
Sorry, meant 2.7.6, not 2.7.3

On Tue, Sep 29, 2020 at 7:40 AM Alan Ward  wrote:

> I wish I could -- this cluster is running on an isolated network and I
> can't get the logs or configs or anything down to the Internet.
>
> But, I just figured out the problem -- I had set a very large value for
> failureDetectionTimeout (default is 10s). When I reverted that to the
> default, everything started working great.
>
> This is interesting, because in 2.7.3, bumping up this setting didn't
> cause the same problem. I went back and forth between 2.7.3 and 2.8.1 a few
> times (using the same config w/ the large failureDetectionTimeout) and was
> able to replicate this -- worked fine in 2.7.3, and broke in 2.8.1.
>
> Hopefully this helps someone else out there,
>
> Alan
>
>
>
> On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov <
> aealexsand...@gmail.com> wrote:
>
>> Hi,
>>
>> Highly likely some of the nodes go offline and try to connect again.
>> Probably you had some network issues. I think I will see this and other
>> information in the logs. Can you provide them?
>>
>> BR,
>> Andrei
>> 9/24/2020 6:54 PM, Alan Ward пишет:
>>
>> The only log I see is from one of the server nodes, which is spewing at a
>> very high rate:
>>
>> [grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming
>> communication connection [locAddr=/:47100, rmtAddr=:
>>
>> Note that each time the log is printed, i see a different value for
>> .
>>
>> Also note  that I only see these logs when i try to run ignitevisorcmd's
>> "cache" command. When I run the java application that calls
>> IgniteCache.size(), I don't see any such logs. But in both cases, the
>> result is that the operation is just hanging.
>>
>> The cluster is active and I am able to insert data (albeit at a pretty
>> slow rate), so it's not like things are completely non-functional. It's
>> really confusing :\
>>
>> On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov 
>> wrote:
>>
>>> Hi,
>>>
>>> Can you please provide the full server logs?
>>>
>>> BR,
>>> Andrei
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>>
>>


Re: IgniteCache.size() is hanging

2020-09-29 Thread Alan Ward
I wish I could -- this cluster is running on an isolated network and I
can't get the logs or configs or anything down to the Internet.

But, I just figured out the problem -- I had set a very large value for
failureDetectionTimeout (default is 10s). When I reverted that to the
default, everything started working great.

This is interesting, because in 2.7.3, bumping up this setting didn't cause
the same problem. I went back and forth between 2.7.3 and 2.8.1 a few times
(using the same config w/ the large failureDetectionTimeout) and was able
to replicate this -- worked fine in 2.7.3, and broke in 2.8.1.

Hopefully this helps someone else out there,

Alan



On Thu, Sep 24, 2020 at 12:08 PM Andrei Aleksandrov 
wrote:

> Hi,
>
> Highly likely some of the nodes go offline and try to connect again.
> Probably you had some network issues. I think I will see this and other
> information in the logs. Can you provide them?
>
> BR,
> Andrei
> 9/24/2020 6:54 PM, Alan Ward пишет:
>
> The only log I see is from one of the server nodes, which is spewing at a
> very high rate:
>
> [grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming
> communication connection [locAddr=/:47100, rmtAddr=:
>
> Note that each time the log is printed, i see a different value for
> .
>
> Also note  that I only see these logs when i try to run ignitevisorcmd's
> "cache" command. When I run the java application that calls
> IgniteCache.size(), I don't see any such logs. But in both cases, the
> result is that the operation is just hanging.
>
> The cluster is active and I am able to insert data (albeit at a pretty
> slow rate), so it's not like things are completely non-functional. It's
> really confusing :\
>
> On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov 
> wrote:
>
>> Hi,
>>
>> Can you please provide the full server logs?
>>
>> BR,
>> Andrei
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>>
>


Re: IgniteCache.size() is hanging

2020-09-24 Thread Alan Ward
The only log I see is from one of the server nodes, which is spewing at a
very high rate:

[grid-nio-worker-tcp-comm-...][TcpCommunicationSpi] Accepted incoming
communication connection [locAddr=/:47100, rmtAddr=:

Note that each time the log is printed, i see a different value for .

Also note  that I only see these logs when i try to run ignitevisorcmd's
"cache" command. When I run the java application that calls
IgniteCache.size(), I don't see any such logs. But in both cases, the
result is that the operation is just hanging.

The cluster is active and I am able to insert data (albeit at a pretty slow
rate), so it's not like things are completely non-functional. It's really
confusing :\

On Thu, Sep 24, 2020 at 11:04 AM aealexsandrov 
wrote:

> Hi,
>
> Can you please provide the full server logs?
>
> BR,
> Andrei
>
>
>
> --
> Sent from: http://apache-ignite-users.70518.x6.nabble.com/
>


IgniteCache.size() is hanging

2020-09-24 Thread Alan Ward
[image: Selected post]
I'm running a 5 node Ignite cluster, version 2.8.1, with persistence
enabled
and a small number of partitioned caches, ranging between a few thousand
records to one cache with over 1 billion records. No SQL use.

When I run a Java client app and connect to the cluster (with clientMode =
true), I connect fine and can retrieve the names of all caches on the
cluster quickly. However, attempting to get the size of a cache via
ignite.getOrCreateCache("existingCacheName").size() just hangs. This
happens
regardless of which cache I try to get the size of.

Sometimes I see a suspicious warning after a minute or so: WARNING: Node
FAILED:
TcpDiscoveryNode[...] - it appears to be referencing my client node. I
don't know why the node failed, what to do about it, or why it seems to
happen so frequently. There are no
relevant logs coming from any of the ignite server nodes, nor the java
app/client.

There are also many times when I do not get a Node FAILED warning, but
still the size() operation just hangs with no other information.

Thanks for your help!

Alan


Ignoring model fields

2019-02-19 Thread Alan Ward
Is there a way (preferably annotation-based) to exclude certain fields in
user-defined model classes from Ignite (cache, query, etc.), similar to how
Jackson has a @JsonIgnore annotation to exclude a field from
serialization/deserialization.

Thanks,
Alan


local deployment of web-console-standalone

2019-02-12 Thread Alan Ward
I'm trying to get a local deployment of the web console working via docker.

I have the latest 2.7.0 version of the web-console-standalone docker image,
started with "docker run -d -p 8080:80 --name web-console-standalone -e
DEBUG=* apacheignite/web-console-standalone

The container starts up fine, and I see "Start listening on 127.0.0.1:3000
in the logs". When I try to access the web console via a browser at
http://:8080/,
it connects, but I get a "Loading..." indicator that never goes away - the
page is otherwise blank. There are no errors being logged from the
container, and no obvious problems via the firefox dev tools/Networking
window.

This is on an enterprise network with no Internet access.

I've also noticed that if I go into the container and copy
/opt/web-console/backend/agent_dists/ignite-web-agent-2.7.0.zip out onto my
host box and unzip it, there is no "default.properties" file as the
documentation seems to indicate there should be. I tried starting up the
web-agent via the resulting ignite-web-agent.sh script, and it fails due to
the security tokens not matching. It seems those tokens should be available
in the web-console, but again, I can't load the /profile page to view them.

Thanks for the help!