Re: Performance and large numbers of servers

Arthur Naseef Wed, 29 Jun 2022 13:44:36 -0700

I posted a PR + Jira ticket with the update:

https://github.com/apache/ignite/pull/10123



The PR checks are still running/pending.  Any feedback/help is appreciated.

Art

On Tue, Jun 28, 2022 at 10:53 PM Pavel Tupitsyn <ptupit...@apache.org>
wrote:

> Thank you for tracking this down! An additional map by name is a good idea
> there.
>
> > CONCURRENCY NOTE: these two maps need to update concurrently
> All updates are triggered by discovery events, which are raised under
> "synchronized (discoEvtMux)" in GridDiscoveryManager,
> so it is safe to update two maps together.
>
> >  is desc.name() unique?
> Yes
>
>
>
> On Wed, Jun 29, 2022 at 2:06 AM Arthur Naseef <artnas...@apache.org>
> wrote:
>
>> The following is taking most of the time:
>>
>> @Nullable private ServiceInfo lookupInRegisteredServices(String name) {
>>     for (ServiceInfo desc : registeredServices.values()) {
>>         if (desc.name().equals(name))
>>             return desc;
>>     }
>>
>>
>>     return null;
>> }
>>
>> After changing that to use a Map lookup:
>>
>>    - 50,000 service startup in *8s* (down from around 70s)
>>    - 100,000 service startup in *14s* (right around 2x of the 50K timing)
>>
>>
>> Here's the change I tested (note it's shortened) - it's not 100%, but
>> fine for my test cast, I believe:
>>
>> private final ConcurrentMap<String, ServiceInfo> registeredServicesByName
>> = new ConcurrentHashMap<>();
>>
>>
>> @Nullable private ServiceInfo lookupInRegisteredServices(String name) {
>>     return registeredServicesByName.get(name);
>> }
>>
>> private void registerService(ServiceInfo desc) {
>>     desc.context(ctx);
>>
>>
>>     // (CONCURRENCY NOTE: these two maps need to update concurrently)
>>     registeredServices.put(desc.serviceId(), desc);
>>     registeredServicesByName.put(desc.name(), desc);
>> }
>>
>>
>> That's in IgniteServiceProcessor.java.
>>
>> Any thoughts?  I'll gladly clean this up and make  PR - would appreciate
>> feedback to help address possible questions with this change (e.g. is
>> desc.name() unique?).
>>
>> Art
>>
>>
>> On Tue, Jun 28, 2022 at 12:27 PM Arthur Naseef <artnas...@apache.org>
>> wrote:
>>
>>> Yes.  The "services" in our case will be schedules that periodically
>>> perform fast operations.
>>>
>>> For example a service could be, "ping this device every <x> seconds".
>>>
>>> Art
>>>
>>> On Tue, Jun 28, 2022 at 12:20 PM Pavel Tupitsyn <ptupit...@apache.org>
>>> wrote:
>>>
>>>> > we do not plan to make cross-cluster calls into the services
>>>>
>>>> If you are making local calls, I think there is no point in using
>>>> Ignite services.
>>>> Can you describe the use case - what are you trying to achieve?
>>>>
>>>> On Tue, Jun 28, 2022 at 8:55 PM Arthur Naseef <artnas...@apache.org>
>>>> wrote:
>>>>
>>>>> Hello - I'm getting started with Ignite and looking seriously at using
>>>>> it for a specific use-case.
>>>>>
>>>>> Working on a Proof-Of-Concept (POC), I am finding a question related
>>>>> to performance, and wondering if the solution, using Ignite Services, is a
>>>>> good fit for the use-case.
>>>>>
>>>>> In my testing, I am getting the following timings:
>>>>>
>>>>>    - Startup of 20,000 ignite services takes 30 seconds
>>>>>    - Startup of 50,000 ignite services takes 250 seconds
>>>>>    - The 2.5x increase from 20,000 to 50,000 yielded > 8x cost in
>>>>>    startup time (appears to be exponential growth)
>>>>>
>>>>> Watching the JVM during this time, I see the following:
>>>>>
>>>>>    - Heap usage is not significant (do not see signs of GC)
>>>>>    - CPU usage is only slightly increased - on the order of 20% total
>>>>>    (system has 12 cores/24 threads)
>>>>>    - Network utilization is reasonable
>>>>>    - Futex system call (measured with "strace -r") appears to be
>>>>>    taking the most time by far.
>>>>>
>>>>> The use-case involves the following:
>>>>>
>>>>>    - Startup of up-to hundreds-of-thousands of services at cluster
>>>>>    spin-up
>>>>>    - Frequent, small adjustments to the services running over time
>>>>>    - Need to rebalance when a new node joins the cluster, or an old
>>>>>    one leaves the cluster
>>>>>    - Once the services are deployed, we do not plan to make
>>>>>    cross-cluster calls into the services (i.e. we do *not* plan to
>>>>>    use ignite's services().serviceProxy() on these)
>>>>>    - Jobs don't look like a fit because these (1) are "long-running"
>>>>>    (actually periodically scheduled tasks) and (2) they need to 
>>>>> redistribute
>>>>>    even after they start running
>>>>>
>>>>> This is starting to get long.  I have more details to share.  Here is
>>>>> the repo with the code being used to test, and a link to a wiki page with
>>>>> some of the details:
>>>>>
>>>>> https://github.com/opennms-forge/distributed-scheduling-poc/
>>>>>
>>>>>
>>>>> https://github.com/opennms-forge/distributed-scheduling-poc/wiki/Ignite-Startup-Performance
>>>>>
>>>>>
>>>>> Questions I have in mind:
>>>>>
>>>>>    - Are services a good fit here?  We expect to reach upwards of
>>>>>    500,000 services in a cluster with multiple nodes.
>>>>>    - Any thoughts on tracking down the bottleneck and alleviating
>>>>>    it?  (I have started taking timing measurements in the Ignite code)
>>>>>
>>>>> Stopping here - please ask questions and I'll gladly fill in details.
>>>>> Any tips are welcome, including ideas for tracking down just where the
>>>>> bottleneck exists.
>>>>>
>>>>> Art
>>>>>
>>>>>

Re: Performance and large numbers of servers

Reply via email to