Re: Operation block on Cluster recovery/rebalance.

John Smith Fri, 14 Aug 2020 15:12:48 -0700

Is there any work around? I can't have an HTTP server block on all requests.


1- I need to figure out why I lose a server nodes every few weeks, which
when rebooting the nodes cause the inactive state until they are back....

2- Implement some kind of logic on the client side not to block the HTTP
part...

Can IgniteCache instance be notified of disconnected events so I can maybe
tell the repository class I have to set a flag to skip the operation?


On Fri., Aug. 14, 2020, 5:17 p.m. Denis Magda, <dma...@apache.org> wrote:

> My guess that it's standard behavior for all operations (SQL, key-value,
> compute, etc.). But I'll let the maintainers of those modules clarify.
>
> -
> Denis
>
>
> On Fri, Aug 14, 2020 at 1:44 PM John Smith <java.dev....@gmail.com> wrote:
>
>> Hi Denis, so to understand it's all operations or just the query?
>>
>> On Fri., Aug. 14, 2020, 12:53 p.m. Denis Magda, <dma...@apache.org>
>> wrote:
>>
>>> John,
>>>
>>> Ok, we nailed it. That's the current expected behavior. Generally, I
>>> agree with you that the platform should support an option when operations
>>> fail if the cluster is deactivated. Could you propose the change by
>>> starting a discussion on the dev list? You can refer to this user list
>>> discussion for reference. Let me know if you need help with this.
>>>
>>> -
>>> Denis
>>>
>>>
>>> On Thu, Aug 13, 2020 at 5:55 PM John Smith <java.dev....@gmail.com>
>>> wrote:
>>>
>>>> No I, reuse the instance. The cache instance is created once at startup
>>>> of the application and I pass it to my "repository" class
>>>>
>>>> public abstract class AbstractIgniteRepository<K,V> implements 
>>>> CacheRepository<K, V> {
>>>>     public final long DEFAULT_OPERATION_TIMEOUT = 2000;
>>>>
>>>>     private Vertx vertx;
>>>>     private IgniteCache<K, V> cache;
>>>>
>>>>     AbstractIgniteRepository(Vertx vertx, IgniteCache<K, V> cache) {
>>>>         this.vertx = vertx;
>>>>         this.cache = cache;
>>>>     }
>>>>
>>>> ...
>>>>
>>>>     Future<List<JsonArray>> query(final String sql, final long timeoutMs, 
>>>> final Object... args) {
>>>>         final Promise<List<JsonArray>> promise = Promise.promise();
>>>>
>>>>         vertx.setTimer(timeoutMs, l -> {
>>>>             promise.tryFail(new TimeoutException("Cache operation did not 
>>>> complete within: " + timeoutMs + " Ms.")); // THIS FIRE IF THE BLOE 
>>>> DOESN"T COMPLETE IN TIME.
>>>>         });
>>>>
>>>>         vertx.<List<JsonArray>>executeBlocking(code -> {
>>>>             SqlFieldsQuery query = new SqlFieldsQuery(sql).setArgs(args);
>>>>             query.setTimeout((int) timeoutMs, TimeUnit.MILLISECONDS);
>>>>
>>>>
>>>>             try (QueryCursor<List<?>> cursor = cache.query(query)) { // 
>>>> <--- BLOCKS HERE.
>>>>                 List<JsonArray> rows = new ArrayList<>();
>>>>                 Iterator<List<?>> iterator = cursor.iterator();
>>>>
>>>>                 while(iterator.hasNext()) {
>>>>                     List currentRow = iterator.next();
>>>>                     JsonArray row = new JsonArray();
>>>>
>>>>                     currentRow.forEach(o -> row.add(o));
>>>>
>>>>                     rows.add(row);
>>>>                 }
>>>>
>>>>                 code.complete(rows);
>>>>             } catch(Exception ex) {
>>>>                 code.fail(ex);
>>>>             }
>>>>         }, result -> {
>>>>             if(result.succeeded()) {
>>>>                 promise.tryComplete(result.result());
>>>>             } else {
>>>>                 promise.tryFail(result.cause());
>>>>             }
>>>>         });
>>>>
>>>>         return promise.future();
>>>>     }
>>>>
>>>>     public <T> T cache() {
>>>>         return (T) cache;
>>>>     }
>>>> }
>>>>
>>>>
>>>>
>>>> On Thu, 13 Aug 2020 at 16:29, Denis Magda <dma...@apache.org> wrote:
>>>>
>>>>> I've created a simple test and always getting the exception below on
>>>>> an attempt to get a reference to an IgniteCache instance in cases when the
>>>>> cluster is not activated:
>>>>>
>>>>> *Exception in thread "main" class org.apache.ignite.IgniteException:
>>>>> Can not perform the operation because the cluster is inactive. Note, that
>>>>> the cluster is considered inactive by default if Ignite Persistent Store 
>>>>> is
>>>>> used to let all the nodes join the cluster. To activate the cluster call
>>>>> Ignite.active(true)*
>>>>>
>>>>> Are you trying to get a new IgniteCache reference whenever the client
>>>>> reconnects successfully to the cluster? My guts feel that currently, 
>>>>> Ignite
>>>>> verifies the activation status and generates the exception above whenever
>>>>> you're getting a reference to an IgniteCache or IgniteCompute. But once 
>>>>> you
>>>>> got those references and try to run some operations then those get stuck 
>>>>> if
>>>>> the cluster is not activated.
>>>>> -
>>>>> Denis
>>>>>
>>>>>
>>>>> On Thu, Aug 13, 2020 at 6:37 AM John Smith <java.dev....@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> The cache.query() starts to block when ignite server nodes are being
>>>>>> restarted and there's no baseline topology yet. The server nodes do not
>>>>>> block. It's the client that blocks.
>>>>>>
>>>>>> The dumpfiles are of the server nodes. The screen shot is from the
>>>>>> client app using your kit profiler on the client side the threads are
>>>>>> marked as red on your kit.
>>>>>>
>>>>>> The app is simple, make http request, it runs cache Sql query on
>>>>>> ignite and if it succeeds does a put back to ignite.
>>>>>>
>>>>>> The Client disconnected exception only happens when all server nodes
>>>>>> in the cluster are down. The blockage only happens when the cluster is
>>>>>> trying to establish baseline topology.
>>>>>>
>>>>>> On Wed., Aug. 12, 2020, 6:28 p.m. Denis Magda, <dma...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> John,
>>>>>>>
>>>>>>> I don't see any traits of an application-caused deadlock in the
>>>>>>> thread dumps. Please elaborate on the following:
>>>>>>>
>>>>>>> 7- Restart 1st node, run operation, operation fails with
>>>>>>>> ClientDisconectedException but application still able to complete it's
>>>>>>>> request.
>>>>>>>
>>>>>>>
>>>>>>> What's the IP address of the server node the client app uses to join
>>>>>>> the cluster? If that's not the address of the 1st node, that is already
>>>>>>> restarted, then the client couldn't join the cluster and it's expected 
>>>>>>> that
>>>>>>> it fails with the ClientDisconnectedException.
>>>>>>>
>>>>>>> 8- Start 2nd node, run operation, from here on all operations just
>>>>>>>> block.
>>>>>>>
>>>>>>>
>>>>>>> Are the operations unblocked and completed successfully when the
>>>>>>> third node joins the cluster and the cluster gets activated 
>>>>>>> automatically?
>>>>>>>
>>>>>>> -
>>>>>>> Denis
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 12, 2020 at 11:08 AM John Smith <java.dev....@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok Denis here they are...
>>>>>>>>
>>>>>>>> 3 nodes and I capture a yourlit screenshot of what it thinks are
>>>>>>>> deadlocks on the client app.
>>>>>>>>
>>>>>>>>
>>>>>>>> https://www.dropbox.com/sh/2cxjkngvx0ubw3b/AADa--HQg-rRsY3RBo2vQeJ9a?dl=0
>>>>>>>>
>>>>>>>> On Wed, 12 Aug 2020 at 11:07, John Smith <java.dev....@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Denis. I will asap but you I think you were right it is the
>>>>>>>>> query that blocks.
>>>>>>>>>
>>>>>>>>> My application first first runs a select on the cache and then
>>>>>>>>> does a put to cache.
>>>>>>>>>
>>>>>>>>> On Tue, 11 Aug 2020 at 19:22, Denis Magda <dma...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> John,
>>>>>>>>>>
>>>>>>>>>> It sounds like a deadlock caused by the application logic. Is
>>>>>>>>>> there any chance that the operation you run on step 8 accesses 
>>>>>>>>>> several keys
>>>>>>>>>> in one order while the other operations work with the same keys but 
>>>>>>>>>> in a
>>>>>>>>>> different order. The deadlocks are possible when you use Ignite 
>>>>>>>>>> Transaction
>>>>>>>>>> API or simply execute bulk operations such as cache.readAll() or
>>>>>>>>>> cache.writeAll(..).
>>>>>>>>>>
>>>>>>>>>> Please take and attach thread dumps from all the cluster nodes
>>>>>>>>>> for analysis if we need to dig deeper.
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> Denis
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Aug 10, 2020 at 6:23 PM John Smith <
>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Denis, I think you are right. It's the query that blocks the
>>>>>>>>>>> other k/v operations are ok.
>>>>>>>>>>>
>>>>>>>>>>> Any thoughts on this?
>>>>>>>>>>>
>>>>>>>>>>> On Mon, 10 Aug 2020 at 15:28, John Smith <java.dev....@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I tried with 2.8.1, same issue. Operations block indefinitely...
>>>>>>>>>>>>
>>>>>>>>>>>> 1- Start 3 node cluster
>>>>>>>>>>>> 2- Start client application client = true with Ignition.start()
>>>>>>>>>>>> 3- Run some cache operations, everything ok...
>>>>>>>>>>>> 4- Shut down one node, run operation, still ok
>>>>>>>>>>>> 5- Shut down 2nd node, run operation, still ok
>>>>>>>>>>>> 6- Shut down 3rd node, run operation, still ok...
>>>>>>>>>>>> Operations start failing with ClientDisconectedException...
>>>>>>>>>>>> 7- Restart 1st node, run operation, operation fails
>>>>>>>>>>>> with ClientDisconectedException but application still able to 
>>>>>>>>>>>> complete it's
>>>>>>>>>>>> request.
>>>>>>>>>>>> 8- Start 2nd node, run operation, from here on all operations
>>>>>>>>>>>> just block.
>>>>>>>>>>>>
>>>>>>>>>>>> Basically the client application is an HTTP Server on each HTTP
>>>>>>>>>>>> request does cache exception.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, 7 Aug 2020 at 19:46, John Smith <java.dev....@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> No, everything blocks... Also using 2.7.0 just in case.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Only time I get exception is if the cluster is completely off,
>>>>>>>>>>>>> then I get ClientDisconectedException...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, 7 Aug 2020 at 18:52, Denis Magda <dma...@apache.org>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> If I'm not mistaken, key-value operations (cache.get/put) and
>>>>>>>>>>>>>> compute calls fail with an exception if the cluster is 
>>>>>>>>>>>>>> deactivated. Do
>>>>>>>>>>>>>> those fail on your end?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> As for the async and SQL operations, let's see what other
>>>>>>>>>>>>>> community members say.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> -
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Aug 7, 2020 at 1:06 PM John Smith <
>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi any thoughts on this?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:33, John Smith <
>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Here is another example where it blocks.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> SqlFieldsQuery query = new SqlFieldsQuery(
>>>>>>>>>>>>>>>>         "select * from my_table")
>>>>>>>>>>>>>>>>         .setArgs(providerId, carrierCode);
>>>>>>>>>>>>>>>> query.setTimeout(1000, TimeUnit.MILLISECONDS);
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> try (QueryCursor<List<?>> cursor = cache.query(query))
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> cache.query just blocks even with the timeout set.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Is there a way to timeout and at least have the application
>>>>>>>>>>>>>>>> continue and respond with an appropriate message?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Thu, 6 Aug 2020 at 23:06, John Smith <
>>>>>>>>>>>>>>>> java.dev....@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi running 2.7.0
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> When I reboot a node and it begins to rejoin the cluster
>>>>>>>>>>>>>>>>> or the cluster is not yet activated with baseline topology 
>>>>>>>>>>>>>>>>> operations seem
>>>>>>>>>>>>>>>>> to block forever, operations that are supposed to return 
>>>>>>>>>>>>>>>>> IgniteFuture. I.e:
>>>>>>>>>>>>>>>>> putAsync, getAsync etc... They just block, until the cluster 
>>>>>>>>>>>>>>>>> resolves it's
>>>>>>>>>>>>>>>>> state.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>

Re: Operation block on Cluster recovery/rebalance.

Reply via email to