Near Cache versus Continuous Query

2021-02-19 Thread tschauenberg
Hi,

I have a use case where I want a fully copy of a replicated cache on a
subset of my thick clients.  In this example, I have an ETL thick client
that creates and updates a replicated cache in my server grid.  I then have
a series of webserver thick clients that I always want a fully up to date
copy of that cache.

NearCache Attempt:
=
I tried using a NearCache on the webserver thick clients but this had two
undesireable problems:
* it created a near cache on each server node which I have no use for since
this is already an replicated cache
* the near cache never received new entries from the replicated cache, it
was only updated with entries it had already had stored in it.

Is there a way I can resolve either of these two undesireable problems of
the NearCache for my situation?


Continuous Query Attempt:
=
This lead me to instead consider Continuous Queries (CQ) where I would have
each webserver maintain it's own Map of the server cache data where upon
startup it uses the CQ initial query to get the current server state on
startup and then uses the CQ local listener.  

Trying to get the CQ working I followed the examples in text in
https://ignite.apache.org/docs/latest/configuring-caches/near-cache however
I can only see the local listener updates if I run the query in my own
thread that I never let finish.

What am I doing wrong in the code below?

Continuous Query code:
---

// Create new continuous query.
val qry = ContinuousQuery()

// have it return all data in its initial query
// Setting an optional initial query.
qry.setInitialQuery(ScanQuery())

// don't set a remote filter as we want all data returned
// qry.setRemoteFilterFactory()

// Callback that is called locally when update notifications are received.
qry.setLocalListener { events ->
println("Update notifications
starting...[thread=${Thread.currentThread()}]")
for (e in events) {
println("Listener event: [thread=${Thread.currentThread()},
key=${e.key}, val=${e.value}]")
}
println("Update notifications finished
[thread=${Thread.currentThread()}]")
}

val executor = Executors.newSingleThreadExecutor()
executor.submit {
myCache.query(qry).use { cur ->

// Iterating over initial query results
println("Initial query cursor
starting...[thread=${Thread.currentThread()}]")
for (e in cur) {
println("Cursor value: [thread=${Thread.currentThread()},
key=${e.key}, val=${e.value}]")
}
println("Initial query cursor finished
[thread=${Thread.currentThread()}]")

println("Starting holding continuous query cursor open so we can get
callback data... [thread=${Thread.currentThread()}]")
val shouldBeRunning = true
while (shouldBeRunning) {
// hold this thread open forever so the local listener callback
can keep processing events
try {
Thread.sleep(5000)
} catch (e: InterruptedException) {
println("Continuous query sleep interrupted
[thread=${Thread.currentThread()}]")
throw e
}
}
println("Stopping holding continuous query cursor open because we
got shutdown signal [thread=${Thread.currentThread()}]")
}
}





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Near Cache versus Continuous Query

2021-02-19 Thread tschauenberg
Hi,

I have a use case where I want a fully copy of a replicated cache on a
subset of my thick clients.  In this example, I have an ETL thick client
that creates and updates a replicated cache in my server grid.  I then have
a series of webserver thick clients that I always want a fully up to date
copy of that cache.

NearCache Attempt:
=
I tried using a NearCache on the webserver thick clients but this had two
undesireable problems:
* it created a near cache on each server node which I have no use for since
this is already an replicated cache
* the near cache never received new entries from the replicated cache, it
was only updated with entries it had already had stored in it.

Is there a way I can resolve either of these two undesireable problems of
the NearCache for my situation?


Continuous Query Attempt:
=
This lead me to instead consider Continuous Queries (CQ) where I would have
each webserver maintain it's own Map of the server cache data where upon
startup it uses the CQ initial query to get the current server state on
startup and then uses the CQ local listener.  

Trying to get the CQ working I followed the examples in text in
https://ignite.apache.org/docs/latest/configuring-caches/near-cache however
I can only see the local listener updates if I run the query in my own
thread that I never let finish.

What am I doing wrong in the code below?

Continuous Query code:
---

// Create new continuous query.
val qry = ContinuousQuery()

// have it return all data in its initial query
// Setting an optional initial query.
qry.setInitialQuery(ScanQuery())

// don't set a remote filter as we want all data returned
// qry.setRemoteFilterFactory()

// Callback that is called locally when update notifications are received.
qry.setLocalListener { events ->
println("Update notifications
starting...[thread=${Thread.currentThread()}]")
for (e in events) {
println("Listener event: [thread=${Thread.currentThread()},
key=${e.key}, val=${e.value}]")
}
println("Update notifications finished
[thread=${Thread.currentThread()}]")
}

val executor = Executors.newSingleThreadExecutor()
executor.submit {
myCache.query(qry).use { cur ->

// Iterating over initial query results
println("Initial query cursor
starting...[thread=${Thread.currentThread()}]")
for (e in cur) {
println("Cursor value: [thread=${Thread.currentThread()},
key=${e.key}, val=${e.value}]")
}
println("Initial query cursor finished
[thread=${Thread.currentThread()}]")

println("Starting holding continuous query cursor open so we can get
callback data... [thread=${Thread.currentThread()}]")
val shouldBeRunning = true
while (shouldBeRunning) {
// hold this thread open forever so the local listener callback
can keep processing events
try {
Thread.sleep(5000)
} catch (e: InterruptedException) {
println("Continuous query sleep interrupted
[thread=${Thread.currentThread()}]")
throw e
}
}
println("Stopping holding continuous query cursor open because we
got shutdown signal [thread=${Thread.currentThread()}]")
}
}





--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Ignite Startup Warnings

2021-02-03 Thread tschauenberg
Paolo,

> I could get rid of most of them by changing my configuration (e.g. setting
> a
default non-logging CheckpointSpi or non-logging  CollisionSpi
implementation)

Would you have the snippet for this config?  If they are your custom
non-logging classes are you comfortable sharing them here too?

> by adding VM parameters (avoid Java9 module access
warnings)

Could you share what VM parameters those are?  

I've been using
java --add-exports=java.base/jdk.internal.misc=ALL-UNNAMED
--add-exports=java.base/sun.nio.ch=ALL-UNNAMED
--add-exports=java.management/com.sun.jmx.mbeanserver=ALL-UNNAMED
--add-exports=jdk.internal.jvmstat/sun.jvmstat.monitor=ALL-UNNAMED
--add-exports=java.base/sun.reflect.generics.reflectiveObjects=ALL-UNNAMED
--illegal-access=permit -Djdk.tls.client.protocols=TLSv1.2

And I still see the warnings with Ignite 2.8.1:
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by
org.apache.ignite.internal.util.GridUnsafe$2
(file:.../ignite-core-2.8.1.jar) to field java.nio.Buffer.address
WARNING: Please consider reporting this to the maintainers of
org.apache.ignite.internal.util.GridUnsafe$2
WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations
WARNING: All illegal access operations will be denied in a future release

Thanks in advance.




--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


RE: incorrect partition map exchange behaviour

2021-01-13 Thread tschauenberg
Sorry about mixing the terminology.  My post was meant to be about the PME
and the primary keys.

So the summary of my post and what it was trying to show was the PME was
only happening on cluster node leaves (server or visor) but not cluster node
joins (at least with previously joined nodes - haven't tested with joining a
brand new node for the first time such as expanding the cluster from 3 nodes
to 4 nodes).

The PME doc suggests the PME should happen on the joins but the logs and
visor/stats are showing that's not happening and it's only happening on the
leaves.

So what I am trying to identify is:
* is this a known bug and if so, which versions is this fixed in?
* what is the impact of the database state where one node has no designated
primaries?  
** This probably effectively reduces the get/put performance to n-1 nodes?
** Also, for things like compute tasks that operate on local data such as
those using SqlFieldsQuery.setLocal(true) the node with no primaries will do
nothing?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: incorrect partition map exchange behaviour

2021-01-13 Thread tschauenberg
Haven't tested on 2.9.1 as we don't have that database provisioned and sadly
won't for awhile.  When we do though I will update.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: incorrect partition map exchange behaviour

2021-01-08 Thread tschauenberg
Here's my attempt to demonstrate and also provide logs

Standup 3 node cluster and load with data


Using a thick client, 250k devices are loaded into the device cache.  The
thick client then leaves.  There's one other thick client connected the
whole time for serving requests but I think that's irrelevant for the test
but want to point it out in case someone notices there's still a client
connected.

Show topology from logs of the client leaving:


[2021-01-08T23:08:05.012Z][INFO][disco-event-worker-#40][GridDiscoveryManager]
Node left topology: TcpDiscoveryNode
[id=611e30ee-b7c6-4ead-a746-f609b206cfb4,
consistentId=611e30ee-b7c6-4ead-a746-f609b206cfb4, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=7, intOrder=6, lastExchangeTime=1610146373751, loc=false,
ver=2.8.1#20200521-sha1:86422096, isClient=true]
[2021-01-08T23:08:05.013Z][INFO][disco-event-worker-#40][GridDiscoveryManager]
Topology snapshot [ver=8, locNode=75e4ddea, servers=3, clients=1,
state=ACTIVE, CPUs=7, offheap=3.0GB, heap=3.1GB]

Start visor on one of the nodes


Show topology from logs


[2021-01-08T23:30:33.461Z][INFO][tcp-disco-msg-worker-[4ea8efe1
10.12.3.76:47500]-#2][TcpDiscoverySpi] New next node
[newNext=TcpDiscoveryNode [id=1cca94e3-f15f-4a8b-9f65-d9b9055a5fa7,
consistentId=10.12.2.110:47501, addrs=ArrayList [10.12.2.110],
sockAddrs=HashSet [/10.12.2.110:47501], discPort=47501, order=0, intOrder=7,
lastExchangeTime=1610148633458, loc=false, ver=2.8.1#20200521-sha1:86422096,
isClient=false]]
[2021-01-08T23:30:34.045Z][INFO][sys-#1011][GridDhtPartitionsExchangeFuture]
Completed partition exchange
[localNode=75e4ddea-1927-4e93-82e9-fdfbb7b58d1c,
exchange=GridDhtPartitionsExchangeFuture [topVer=AffinityTopologyVersion
[topVer=9, minorTopVer=0], evt=NODE_JOINED, evtNode=TcpDiscoveryNode
[id=1cca94e3-f15f-4a8b-9f65-d9b9055a5fa7, consistentId=10.12.2.110:47501,
addrs=ArrayList [10.12.2.110], sockAddrs=HashSet [/10.12.2.110:47501],
discPort=47501, order=9, intOrder=7, lastExchangeTime=1610148633458,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=false], done=true,
newCrdFut=null], topVer=AffinityTopologyVersion [topVer=9, minorTopVer=0]]

Show data balanced in visor


+---+-+---+-+---+---+---+---++
| Devices(@c2)  | PARTITIONED | 3 | 25 (0 / 25)
| min: 80315 (0 / 80315)| min: 0| min: 0| min: 0|
min: 25|
|   | |   |
| avg: 8.33 (0.00 / 8.33)   | avg: 0.00 | avg: 0.00 | avg: 0.00 |
avg: 25.00 |
|   | |   |
| max: 86968 (0 / 86968)| max: 0| max: 0| max: 0|
max: 25|
+---+-+---+-+---+---+---+---++

At this point the data is all relatively balanced and the topology increased
when visor connected.

Stop ignite on one node


Show topology and PME from logs (from a different ignite node as the ignite
process was stopped)


[2021-01-08T23:35:39.333Z][INFO][disco-event-worker-#40][GridDiscoveryManager]
Node left topology: TcpDiscoveryNode
[id=75e4ddea-1927-4e93-82e9-fdfbb7b58d1c,
consistentId=3a4a497f-5a89-4f2c-8531-b2b05f2ede22, addrs=ArrayList
[10.12.2.110], sockAddrs=HashSet [/10.12.2.110:47500], discPort=47500,
order=3, intOrder=3, lastExchangeTime=1610139164908, loc=false,
ver=2.8.1#20200521-sha1:86422096, isClient=false]
[2021-01-08T23:35:39.333Z][INFO][disco-event-worker-#40][GridDiscoveryManager]
Topology snapshot [ver=10, locNode=4ea8efe1, servers=2, clients=1,
state=ACTIVE, CPUs=5, offheap=2.0GB, heap=2.1GB]
[2021-01-08T23:35:39.333Z][INFO][disco-event-worker-#40][GridDiscoveryManager]  
^-- Baseline [id=0, size=3, online=2, offline=1]
[2021-01-08T23:35:39.335Z][INFO][exchange-worker-#41][time] Started exchange
init [topVer=AffinityTopologyVersion [topVer=10, minorTopVer=0], crd=true,
evt=NODE_LEFT, evtNode=75e4ddea-1927-4e93-82e9-fdfbb7b58d1c, customEvt=null,
allowMerge=false, exchangeFreeSwitch=true]
[2021-01-08T23:35:39.338Z][INFO][sys-#1031][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal [cache=Households,
expectedPrimary=512.00, actualPrimary=548, expectedBackups=1024.00,
actualBackups=476, warningThreshold=50.00%]
[2021-01-08T23:35:39.340Z][INFO][sys-#1032][GridAffinityAssignmentCache]
Local node affinity assignment distribution is not ideal [cache=Devices,
expectedPrimary=512.00, actualPrimary=548, expectedBackups=1024.00,
actualBackups=476, warningThreshold=50.00%]
[2021-01-08T23:35:39.354Z][INFO][exchange-worker-#41][GridDhtPartitionsExchangeFuture]
Finished waiting for partition release future
[topVer=AffinityT

incorrect partition map exchange behaviour

2021-01-07 Thread tschauenberg
Hi,

We have a cluster of Ignite 2.8.1 server nodes and have recently started
looking at the individual cache metrics for primary keys
org.apache.ignite.internal.processors.cache.CacheLocalMetricsMXBeanImpl.OffHeapPrimaryEntriesCount

In our configuration we have a replicated cache with 2 backups.  Our cluster
has 3 nodes in it so the primaries should be spread equally on the 3 nodes
and each node has backups from the other two nodes.  All these server nodes
are in the baseline.  Additionally we have some thick clients connected but
I don't think they are relevant to the discussion.

Whenever we do a rolling restart one node at a time, at the end after the
last node is restarted it always owns zero primaries and owns solely
backups.  The two nodes restarted earlier during the rolling restart own all
the primaries.

When our cluster is in this scenario, if we start and stop visor, when visor
leaves the cluster it triggers a PME where all keys get balanced on all
server nodes.  Looking at the visor cache stats between the start and stop
we can see a min of 0 keys on the nodes for our cache so visor and the jmx
metrics line up on that front.  After stopping visor, the jmx metrics show
the evenly distributed primaries and then starting visor a second time we
can confirm that again the min, average, max node keys are all evenly
distributed.

Every join and leave during the rolling restart and during visor start/stop
shows reflects a topology increment and node leave and join events in the
logs.  

According to
https://cwiki.apache.org/confluence/display/IGNITE/%2528Partition+Map%2529+Exchange+-+under+the+hood
each leave and join should trigger the PME but we only see the keys changing
on the leaves.

Additionally, we tried waiting longer between the stop and start part of the
rolling restart to see if that had any effect.  We ensured we waited long
enough for a PME to do any moving but waiting longer didn't have any effect. 
The stop always has the PME move the keys off that node and the start never
sees the PME move any primaries back.

Why are we only seeing the PME change keys when nodes (server or visor) stop
and never when they join?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Ignite timeouts and trouble interpreting the logs

2020-10-30 Thread tschauenberg
First some background.  Ignite 2.8.1 with a 3 node cluster, two webserver
client nodes, and one batch processing client node that comes and goes.

The two webserver thick client nodes and the one batch processing thick
client node have the following configuration values:
* IgniteConfiguration.setNetworkTimeout(6)
* IgniteConfiguration.setFailureDetectionTimeout(12)
* TcpDiscoverySpi.setJoinTimeout(6)
* TcpCommunicationSpi.setIdleConnectionTimeout(Long.MAX_VALUE)

The server nodes do not have any timeouts set and are currently using all
defaults.  My understanding is that means they are using:
* failureDetectionTimeout 1
* clientFailureDetectionTimeout 3
 
Every so often the batch processing client node fails to connect to the
cluster.  We try to connect the batch processing client node to a single
node in the cluster using:
TcpDiscoverySpi.setIpFinder(TcpDiscoveryVmIpFinder().setAddresses(single
node ip)

I see the following stream of logs on the server node the client connects to
and I am hoping you can shed some light into what timeout values I have set
incorrectly what values I need to set instead.

In these logs I have obfuscated the client IP to 10.1.2.xxx and the server
IP as 10.1.10.xxx


On the server node that the client tries to connect to I see the following
sequence of messages:

[20:21:28,092][INFO][exchange-worker-#42][GridCachePartitionExchangeManager]
Skipping rebalancing (nothing scheduled) [top=AffinityTopologyVersion
[topVer=4146, minorTopVer=0], force=false, evt=NODE_JOINED,
node=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3]

So the client joined the cluster almost at exactly the same time it tried to
join which seems good so far.

Then I see
[20:21:54,726][INFO][db-checkpoint-thread-#56][GridCacheDatabaseSharedManager]
Skipping checkpoint (no pages were modified) [checkpointBeforeLockTime=6ms,
checkpointLockWait=0ms, checkpointListenersExecuteTime=6ms,
checkpointLockHoldTime=8ms, reason='timeout']
[20:21:58,044][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:47585
client]-#4176][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:47585, rmtPort=47585

[20:21:58,045][WARNING][grid-timeout-worker-#23][TcpDiscoverySpi] Socket
write has timed out (consider increasing
'IgniteConfiguration.failureDetectionTimeout' configuration property)
[failureDetectionTimeout=1, rmtAddr=/10.1.2.xxx:47585, rmtPort=47585,
sockTimeout=5000]

I don't understand this socket timeout line because that remote address is
the client remote address so I don't know what it was doing here and this
failureDetectionTimeout isn't the clientFailureDetectionTimeout which I
don't get.

It then seems to connect just fine to the client discovery here

[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery accepted incoming connection [rmtAddr=/10.1.2.xxx, rmtPort=56921]
[20:22:10,170][INFO][tcp-disco-srvr-[:47500]-#3][TcpDiscoverySpi] TCP
discovery spawning a new thread for connection [rmtAddr=/10.1.2.xxx,
rmtPort=56921]
[20:22:10,171][INFO][tcp-disco-sock-reader-[]-#4178][TcpDiscoverySpi]
Started serving remote node connection [rmtAddr=/10.1.2.xxx:56921,
rmtPort=56921]
[20:22:10,175][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Initialized connection with remote client
node [nodeId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
rmtAddr=/10.1.2.xxx:56921]
[20:22:27,870][INFO][tcp-disco-sock-reader-[1b91b2a5 10.1.2.xxx:56921
client]-#4178][TcpDiscoverySpi] Finished serving remote node connection
[rmtAddr=/10.1.2.xxx:56921, rmtPort=56921

The client hits its timeout at 20:22:28 which is the 60 seconds timeout we
give it from 20:21:28, so this finished message is almost at the exact same
time as the timeout threshold.  

Given that socket timeout above, is the second chunk of logs from
20:22:10-20:22:27 a client discovery retry?  

The client exits at 20:22:28 because of its 60 seconds timeout and probably
didn't get the above discovery response message in time?

This server node then notices the client didn't respond within 30 seconds
from 20:22:27 to 20:22:57 (and since it timed out at 20:22:28 and exited
that generally seems to fit):

[20:22:57,811][WARNING][tcp-disco-msg-worker-[21ddf49c 10.1.10.xxx:47500
crd]-#2][TcpDiscoverySpi] Failing client node due to not receiving metrics
updates from client node within
'IgniteConfiguration.clientFailureDetectionTimeout' (consider increasing
configuration property) [timeout=3, node=TcpDiscoveryNode
[id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3, addrs=ArrayList
[127.0.0.1, 172.17.0.3], sockAddrs=HashSet [/127.0.0.1:0, /172.17.0.3:0],
discPort=0, order=4146, intOrder=2076, lastExchangeTime=1604089287814,
loc=false, ver=2.8.1#20200521-sha1:86422096, isClient=true]]
[20:22:57,812][WARNING][disco-event-worker-#41][GridDiscoveryManager] Node
FAILED: TcpDiscoveryNode [id=1b91b2a5-05ac-4809-8a3d-c1c2efb6a3e3,
consistentId=1b91b2a5-05ac-480

Re: "Node with same ID" error

2020-10-01 Thread tschauenberg
I also found two stack overflow comments suggesting they saw it after
upgrading to 2.8.1:

https://stackoverflow.com/questions/62258394/i-would-like-to-know-the-cause-for-this-error-org-apache-ignite-spi-ignitespiex#comment111425289_62258394

https://stackoverflow.com/questions/62258394/i-would-like-to-know-the-cause-for-this-error-org-apache-ignite-spi-ignitespiex#comment110110653_62258882

In our case our cluster was up and running for months and then we
encountered this problem and can't resolve it.  Nothing changed in the
network topology or server configuration.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


"Node with same ID" error

2020-10-01 Thread tschauenberg
Background: Running an Ignite 2.8.1 cluster. 3 node server configuration with
one persistent client and one or more ad hoc clients.

Problem: We ssh'ed onto one of the nodes and ran visor there to quickly
gather cache stats.  Visor hung indefinitely and one of the 3 nodes had
their ignite process exited.  We kill -9'ed Visor.  We then attempted to
start the failed ignite process.

We tried unsuccessfully and saw the error "Node with the same ID was found
in node IDs history or existing node in topology has the same ID".  We
waited and tried again and then it connected just fine.

To try and verify if the cluster was "healthy" we thought, ok, let's try
stopping that ignite process again and restart it just to verify things are
back to normal.

This put us in a situation where every single attempt to start resulted in
"Caused by: class org.apache.ignite.spi.IgniteSpiException: Node with the
same ID was found in node IDs history or existing node in topology has the
same ID (fix configuration and restart local node)"

We removed this node from the baseline.  Then we deleted its work directory
and attempted to restart and see the same problem.  We then destroyed the
machine entire and created a new machine with a fresh install of ignite and
that new machine won't start its ignite process either with the same error.

We are now in a state where we can't join any new nodes to the cluster at
all and every attempt whether it's a new machine reports the same error.

How can we repair our cluster to get rid of this error and get a new node to
join?

Thanks,
Terence.



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Visor configuration

2020-08-19 Thread tschauenberg
To use visor we typically ssh onto a server node and use run visor there. 
When doing so we launch visor with the exact same configuration as that
server node is running.  

Two questions regarding this:
* Is running visor from a server node problematic?  
* Should we be using a different configuration for visor such as one that
sets IgniteConfiguration.clientMode to true?

Additionally related to this we see running visor often causes one of N
server nodes to be terminated in Ignite 2.7.0 (haven't tried reproducing in
2.8.1 as we need to upgrade first).  I think this related to us not having
the failureDetectionTimeout set anywhere in the config.  

Two questions regarding this:
* Why are the server nodes able to be connected just fine without failures
but as soon as visor is connected it causes one of them to be kicked for
responding too slowly?  The servers and visor are all using the same config
and are all in the same network.  Visor in this scenario is running on one
of the server nodes.
* When setting the failureDetectionTimeout does it have to be set to the
same value on all server nodes, all client nodes, and on visor?  Or is
failureDetectionTimeout a setting on the local node only for determining how
long that local node will wait talking to remote nodes?  For example, if we
are seeing the problem just when starting visor is it reasonable to increase
the failureDetectionTimeout just in the visor configuration?



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Bug with client cache_size for metrics() and localMetrics()

2020-07-21 Thread tschauenberg
Apologies.  I modified the tests to wait longer for the 2.8.1 test scenario
on the client printouts and eventually the cluster metrics started
reporting.

Client - Cluster metrics: cache_size=100
Client -   Local metrics: cache_size=0



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Bug with client cache_size for metrics() and localMetrics()

2020-07-21 Thread tschauenberg
IgniteClusterClient.java

  
IgniteClusterNode1.java

  
IgniteClusterNode2.java

  

The following bug could be reproduced on both server and client nodes with
Ignite 2.7.0.  In Ignite 2.8.1, the server node counts are correct but the
client node counts are still incorrect.

*Procedure:*

1. Run attached IgniteClusterNode1.java
2. Run attached IgniteClusterNode2.java
3. Run attached IgniteClusterClient.java
4. Wait for client to put 100 items in the test cache and then observe the
local and cluster wide cache_size metrics
5. The cluster wide metrics should be 100 on both servers and client.  The
local metrics on the server should sum to 100 and the client should show 0.

*Summary:
*
* both server and client counts were wrong in Ignite 2.7.0
* server counts were fixed somewhere before Ignite 2.8.1 but client counts
are still wrong 

*Ignite 2.7.0 Results:*

*Server1:* 

/Actual:/

Server1 - Cluster metrics: cache_size=40
Server1 - Local metrics: cache_size=40

/Expected:/

Server1 - Cluster metrics: cache_size=100
Server1 - Local metrics: cache_size=40

/Working Incorrectly/

*Server2:*

/Actual:/

Server2 - Cluster metrics: cache_size=60
Server2 - Local metrics: cache_size=60

/Expected:/

Server2 - Cluster metrics: cache_size=100
Server2 - Local metrics: cache_size=60

/Working Incorrectly/

*Client:*

/Actual:/

Client - Cluster metrics: cache_size=0
Client - Local metrics: cache_size=0

/Expected:/

Client - Cluster metrics: cache_size=100
Client - Local metrics: cache_size=0

/Working Incorrectly/

*Ignite 2.8.1 Results:*

*Server1:*

/Actual and Expected:/

Server1 - Cluster metrics: cache_size=100
Server1 - Local metrics: cache_size=40

/Working Correctly/

*Server2:*

/Actual and Expected:/

Server2 - Cluster metrics: cache_size=100
Server2 - Local metrics: cache_size=60

/Working Correctly/

*Client:*

/Actual:/

Client - Cluster metrics: cache_size=0
Client - Local metrics: cache_size=0

/Expected:/

Client - Cluster metrics: cache_size=100
Client - Local metrics: cache_size=0

/Working Incorrectly/



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/