Re: Operation block on Cluster recovery/rebalance.

Denis Magda Fri, 14 Aug 2020 15:23:15 -0700

@Evgenii Zhuravlev <ezhurav...@gridgain.com>, @Ilya Kasnacheev
<ilya.kasnach...@gmail.com>, any thoughts on this?


As a dirty workaround, you can update your cache references on client
reconnect events. You will be getting an exception by calling
ignite.cache(cacheName) in the time when the cluster is not activated yet.
Does this work for you?

-
Denis


On Fri, Aug 14, 2020 at 3:12 PM John Smith <java.dev....@gmail.com> wrote:

> Is there any work around? I can't have an HTTP server block on all
> requests.
>
> 1- I need to figure out why I lose a server nodes every few weeks, which
> when rebooting the nodes cause the inactive state until they are back....
>
> 2- Implement some kind of logic on the client side not to block the HTTP
> part...
>
> Can IgniteCache instance be notified of disconnected events so I can maybe
> tell the repository class I have to set a flag to skip the operation?
>
>
> On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <dma...@apache.org> wrote:
>
>> My guess that it's standard behavior for all operations (SQL, key-value,
>> compute, etc.). But I'll let the maintainers of those modules clarify.
>>
>> -
>> Denis
>>
>>
>> On Fri, Aug 14, 2020 at 1:44 PM John Smith <java.dev....@gmail.com>
>> wrote:
>>
>>> Hi Denis, so to understand it's all operations or just the query?
>>>
>>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <dma...@apache.org>
>>> wrote:
>>>
>>>> John,
>>>>
>>>> Ok, we nailed it. That's the current expected behavior. Generally, I
>>>> agree with you that the platform should support an option when operations
>>>> fail if the cluster is deactivated. Could you propose the change by
>>>> starting a discussion on the dev list? You can refer to this user list
>>>> discussion for reference. Let me know if you need help with this.
>>>>
>>>> -
>>>> Denis
>>>>
>>>>
>>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <java.dev....@gmail.com>
>>>> wrote:
>>>>
>>>>> No I, reuse the instance. The cache instance is created once at
>>>>> startup of the application and I pass it to my "repository" class
>>>>>
>>>>> public abstract class AbstractIgniteRepository<K,V> implements 
>>>>> CacheRepository<K, V> {
>>>>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>>>>
>>>>>     private Vertx vertx;
>>>>>     private IgniteCache<K, V> cache;
>>>>>
>>>>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>>>>         this.vertx = vertx;
>>>>>         this.cache = cache;
>>>>>     }
>>>>>
>>>>> ...
>>>>>
>>>>>     Future<List<JsonArray>> query(final String sql, final long timeoutMs, 
>>>>> final Object... args) {
>>>>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>>>>
>>>>>         vertx.setTimer(timeoutMs, l -> {
>>>>>             promise.tryFail(new TimeoutException("Cache operation did not 
>>>>> complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE 
>>>>> DOESN"T COMPLETE IN TIME.
>>>>>         });
>>>>>
>>>>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>>>>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>>>>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>>>>
>>>>>
>>>>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // 
>>>>> <--- BLOCKS HERE.
>>>>>                 List<JsonArray> rows = new ArrayList<>();
>>>>>                 Iterator<List<?>> iterator = cursor.iterator();
>>>>>
>>>>>                 while(iterator.hasNext()) {
>>>>>                     List currentRow = iterator.next();
>>>>>                     JsonArray row = new JsonArray();
>>>>>
>>>>>                     currentRow.forEach(o -> row.add(o));
>>>>>
>>>>>                     rows.add(row);
>>>>>                 }
>>>>>
>>>>>                 code.complete(rows);
>>>>>             } catch(Exception ex) {
>>>>>                 code.fail(ex);
>>>>>             }
>>>>>         }, result -> {
>>>>>             if(result.succeeded()) {
>>>>>                 promise.tryComplete(result.result());
>>>>>             } else {
>>>>>                 promise.tryFail(result.cause());
>>>>>             }
>>>>>         });
>>>>>
>>>>>         return promise.future();
>>>>>     }
>>>>>
>>>>>     public <T> T cache() {
>>>>>         return (T) cache;
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <dma...@apache.org> wrote:
>>>>>
>>>>>> I've created a simple test and always getting the exception below on
>>>>>> an attempt to get a reference to an IgniteCache instance in cases when 
>>>>>> the
>>>>>> cluster is not activated:
>>>>>>
>>>>>> *Exception in thread "main" class org.apache.ignite.IgniteException:
>>>>>> Can not perform the operation because the cluster is inactive. Note, that
>>>>>> the cluster is considered inactive by default if Ignite Persistent Store 
>>>>>> is
>>>>>> used to let all the nodes join the cluster. To activate the cluster call
>>>>>> Ignite.active(true)*
>>>>>>
>>>>>> Are you trying to get a new IgniteCache reference whenever the client
>>>>>> reconnects successfully to the cluster? My guts feel that currently, 
>>>>>> Ignite
>>>>>> verifies the activation status and generates the exception above whenever
>>>>>> you're getting a reference to an IgniteCache or IgniteCompute. But once 
>>>>>> you
>>>>>> got those references and try to run some operations then those get stuck 
>>>>>> if
>>>>>> the cluster is not activated.
>>>>>> -
>>>>>> Denis
>>>>>>
>>>>>>
>>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <java.dev....@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> The cache.query() starts to block when ignite server nodes are being
>>>>>>> restarted and there's no baseline topology yet. The server nodes do not
>>>>>>> block. It's the client that blocks.
>>>>>>>
>>>>>>> The dumpfiles are of the server nodes. The screen shot is from the
>>>>>>> client app using your kit profiler on the client side the threads are
>>>>>>> marked as red on your kit.
>>>>>>>
>>>>>>> The app is simple, make http request, it runs cache Sql query on
>>>>>>> ignite and if it succeeds does a put back to ignite.
>>>>>>>
>>>>>>> The Client disconnected exception only happens when all server nodes
>>>>>>> in the cluster are down. The blockage only happens when the cluster is
>>>>>>> trying to establish baseline topology.
>>>>>>>
>>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <dma...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> John,
>>>>>>>>
>>>>>>>> I don't see any traits of an application-caused deadlock in the
>>>>>>>> thread dumps. Please elaborate on the following:
>>>>>>>>
>>>>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>>>>> ClientDisconectedException but application still able to complete it's
>>>>>>>>> request.
>>>>>>>>
>>>>>>>>
>>>>>>>> What's the IP address of the server node the client app uses to
>>>>>>>> join the cluster? If that's not the address of the 1st node, that is
>>>>>>>> already restarted, then the client couldn't join the cluster and it's
>>>>>>>> expected that it fails with the ClientDisconnectedException.
>>>>>>>>
>>>>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>>>>> block.
>>>>>>>>
>>>>>>>>
>>>>>>>> Are the operations unblocked and completed successfully when the
>>>>>>>> third node joins the cluster and the cluster gets activated 
>>>>>>>> automatically?
>>>>>>>>
>>>>>>>> -
>>>>>>>> Denis
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <java.dev....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok Denis here they are...
>>>>>>>>>
>>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>>>>>> deadlocks on the client app.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>>>>
>>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <java.dev....@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the
>>>>>>>>>> query that blocks.
>>>>>>>>>>
>>>>>>>>>> My application first first runs a select on the cache and then
>>>>>>>>>> does a put to cache.
>>>>>>>>>>
>>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <dma...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> John,
>>>>>>>>>>>
>>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is
>>>>>>>>>>> there any chance that the operation you run on step 8 accesses 
>>>>>>>>>>> several keys
>>>>>>>>>>> in one order while the other operations work with the same keys but 
>>>>>>>>>>> in a
>>>>>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>>>>>> Transaction
>>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>>>>> cache.writeAll(..).
>>>>>>>>>>>
>>>>>>>>>>> Please take and attach thread dumps from all the cluster nodes
>>>>>>>>>>> for analysis if we need to dig deeper.
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> Denis
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <
>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks the
>>>>>>>>>>>> other k/v operations are ok.
>>>>>>>>>>>>
>>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <
>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block
>>>>>>>>>>>>> indefinitely...
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>>>>> 2- Start client application client = true with Ignition.start()
>>>>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>>>>>> complete it's
>>>>>>>>>>>>> request.
>>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>>>>> just block.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Basically the client application is an HTTP Server on each
>>>>>>>>>>>>> HTTP request does cache exception.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <
>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Only time I get exception is if the cluster is
>>>>>>>>>>>>>> completely off, then I get ClientDisconectedException...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <dma...@apache.org>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put)
>>>>>>>>>>>>>>> and compute calls fail with an exception if the cluster is 
>>>>>>>>>>>>>>> deactivated. Do
>>>>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>>>>> community members say.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -
>>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the
>>>>>>>>>>>>>>>>> application continue and respond with an appropriate message?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster
>>>>>>>>>>>>>>>>>> or the cluster is not yet activated with baseline topology 
>>>>>>>>>>>>>>>>>> operations seem
>>>>>>>>>>>>>>>>>> to block forever, operations that are supposed to return 
>>>>>>>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>>>>>>>>>> resolves it's
>>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to