Hi Simon,

Thanks for your explanation. It will help me manage expectations with the team 
that developed the flow. We were hoping to do exactly as you suggest, drop in a 
redundant cache without the time and resource investment of setting up an 
external cluster like Redis or Hazelcast. And in fact, it runs fine on most 
days, but as currently set up it doesn't play nice when the load on the cluster 
gets too high or nodes disconnect.

If I get the time to run some tests I'll share the results, but for now I'll 
advise the devs to accept a longer run and schedule the DetectDuplicate less 
often or to revert to using the DistributedMapCacheServer on a single node 
again. If neither is acceptable they can request an external cache service 
cluster.

Thank you very much,

Isha

-----Oorspronkelijk bericht-----
Van: Simon Bence <simonbence....@gmail.com> 
Verzonden: woensdag 22 februari 2023 10:47
Aan: users@nifi.apache.org
Onderwerp: Re: Embedded Hazelcast Cachemanager

Hi Isha,

Without deeper understanding of the situation I am not sure if the load comes 
from entirely this part of this given batch processing, but for the scope of 
the discussion I do assume this also with the assumption that this shows 
drastic contrast with the same measurements using DistributedMapCache as cache.

The EmbeddedHazelcastCacheManager was primarily added for simpler scenarios as 
an out of the box solution, might be "grabbed to the canvas” without much fuss. 
As of this, it has very limited customisation capabilities. As your scenario 
looks to utilize Hazelcast heavily, this might not be the ideal tool. Also it 
is important to mention, that in case of the embedded approach, the Hazelcast 
instances running on the same server, thus they are adding to the load already 
produced by other parts of the flow.

Using ExternalHazelcastCacheManager can provide much more flexibility: as it 
works with standalone Hazelcast instances, this approach opens the whole range 
of performance optimization capabilities of it. You can use either one single 
instance touched by all the nodes (which comes with no synchronization between 
Hazelcast nodes but might be a bottleneck at some point) or even build up a 
separate cluster. Of course, the results are highly depend on network topology 
and other factors specific for your use case.

Also I am not sure the details of your flows or if you prefer processing time 
over throughput or not, but it is also a possible optimization opportunity to 
distribute the batch in time resulting smaller peaks.

Best regards,
Bence


> On 2023. Feb 21., at 21:45, Isha Lamboo <isha.lam...@virtualsciences.nl> 
> wrote:
> 
> Hi Simon,
>  
> The Hazelcast cache is being used by a DetectDuplicate processor to cache and 
> eliminate message ids. These arrive in large daily batches with 300-500k 
> messages, most (90+%) of which are actually duplicates. This was previously 
> done with a DistributedMapCacheServer, but that involved using only one of 
> the nodes (hardcoded in the MapCacheClient controller), giving us a single 
> point of failure for the flow. We had hoped to use Hazelcast to have a 
> redundant cacheserver, but I’m starting to think that this scenario causes 
> too many concurrent updates of the cache, on top of the already heavy load 
> from other processing on the batch.
>  
> What was new to me is the CPU load on the cluster in question going through 
> the roof, on all 3 nodes. I have no idea how a 16 vCPU server gets to a load 
> of 100+.
>  
> The start roughly coincides with the arrival of the daily batch, though there 
> may have been other batch processes going on since it’s a Sunday. However, 
> the queues were pretty much empty again in an hour and yet the craziness kept 
> going until I finally decided to restart all nodes.
> <image001.png>
>  
> The hazelcast troubles might well be a side-effect of the NiFi servers being 
> overloaded. There could have been issues at the Azure VM level etc. But 
> activating the Hazelcast controller is the only change I *know* about. And it 
> doesn’t seem farfetched that it got into a loop trying to migrate/copy 
> partitions “lost” on other nodes.
>  
> I’ve attached a file with selected hazelcast warnings and errors from the 
> nifi-app.log files, trying to include as many unique ones as possible.
>  
> The errors that kept repeating where these (always together):
>  
> 2023-02-19 08:58:39,899Z (UTC+0) ERROR 
> [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
> c.h.i.c.i.operations.LockClusterStateOp [su20cnifi103-ap.REDACTED.nl]:5701 
> [nifi] [4.2.5] Still have pending migration tasks, cannot lock cluster state! 
> New state: ClusterStateChange{type=class com.hazelcast.cluster.ClusterState, 
> newState=FROZEN}, current state: ACTIVE
> 2023-02-19 08:58:39,900Z (UTC+0) WARN 
> [hz.68e948cb-6e3f-445e-b1c8-70311cae9b84.cached.thread-47] 
> c.h.internal.cluster.impl.TcpIpJoiner [su20cnifi103-ap.REDACTED.nl]:5701 
> [nifi] [4.2.5] While changing cluster state to FROZEN! 
> java.lang.IllegalStateException: Still have pending migration tasks, cannot 
> lock cluster state! New state: ClusterStateChange{type=class 
> com.hazelcast.cluster.ClusterState, newState=FROZEN}, current state: ACTIVE
>  
> Thanks,
>  
> Isha
>  
> Van: Simon Bence <simonbence....@gmail.com> 
> Verzonden: dinsdag 21 februari 2023 08:52
> Aan: users@nifi.apache.org
> Onderwerp: Re: Embedded Hazelcast Cachemanager
>  
> Hi Isha,
>  
> Could you please share the error messages? It might bring light to something 
> might effect the performance.
>  
> In the other hand, I am not aware of exhaustive performance tests for the 
> Hazelcast Cache. In general it should not be the bottleneck, but if you could 
> please give some details about the error and possibly the intended way of 
> usage, it could help to find a more specific answer.
>  
> Best regards,
> Bence Simon 
> 
> 
> On 2023. Feb 20., at 15:19, Isha Lamboo <isha.lam...@virtualsciences.nl> 
> wrote:
>  
> Hi all,
>  
> This morning I had to fix up a cluster of NiFi 1.18.0 servers where the 
> primary was constantly crashing and moving to the next server.
>  
> One of the recent changes was activating an Embedded Hazelcast Cache, and I 
> did see errors reported trying with promotions going wrong. I can’t tell if 
> this is cause or effect, so I’m trying to get a feeling for the performance 
> demands of Hazelcast, but there is nothing, only a time to live for cache 
> items. The diagnostics dump also didn’t give me anything on this 
> controllerservice.
>  
> Does anyone have experience with tuning/diagnosing the Hazelcast components 
> within NiFi?
>  
> Met vriendelijke groet,
> 
> Isha Lamboo
> Data Engineer
>  <image001.png>
>  
> <nifi_hazelcast_log.txt>

Reply via email to