[ 
https://issues.apache.org/jira/browse/IGNITE-16406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mirza Aliev updated IGNITE-16406:
---------------------------------
    Description: 
For some reasons select operation couldn't return expected number of rows. We 
noticed that this happens when raft leader is changing. To increase 
reproducibility, we can slow down a bit message handling, for example by adding 
this code to {{MessageServiceImpl#onMessage(java.lang.String, 
org.apache.ignite.network.NetworkMessage)}}

{code:java}
        if (ThreadLocalRandom.current().nextInt(3) % 2 == 0) {
            try {
                Thread.sleep(300);
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }
{code}


Possible direction of research: 
we could check that we do not lose cursor.next command as a raft response 
during the process of leader changing.

UPD: 
We decided to add checking for consistency between received scan command and 
handled scan command in partition listener, so now a user will get state 
machine error and could retry his command. But we found another inconsistency 
when RocksDB could return hasNext == false after an unexpected step down of the 
leader (https://issues.apache.org/jira/browse/IGNITE-16478).

So, we decided then to change the replica factor to 1 in 
{{ItMixedQueriesTest}}, so there will be only one node in a partition Raft 
group, but we couldn't enable {{ItMixedQueriesTest}} because of new error 
https://issues.apache.org/jira/browse/IGNITE-16502


  was:
For some reasons select operation couldn't return expected number of rows. We 
noticed that this happens when raft leader is changing. To increase 
reproducibility, we can slow down a bit message handling, for example by adding 
this code to {{MessageServiceImpl#onMessage(java.lang.String, 
org.apache.ignite.network.NetworkMessage)}}

{code:java}
        if (ThreadLocalRandom.current().nextInt(3) % 2 == 0) {
            try {
                Thread.sleep(300);
            } catch (Exception ex) {
                ex.printStackTrace();
            }
        }
{code}


Possible direction of research: 
we could check that we do not lose cursor.next command as a raft response 
during the process of leader changing.

UPD: We decided to add checking for consistency between received scan command 
and handled scan command in partition listener, so now a user will get state 
machine error and could retry his command. But we found another inconsistency 
when RocksDB could return hasNext == false after an unexpected step down of the 
leader (https://issues.apache.org/jira/browse/IGNITE-16478).

So, we decided then to change the replica factor to 1 in 
{{ItMixedQueriesTest}}, so there will be only one node in a partition Raft 
group, but we couldn't enable {{ItMixedQueriesTest}} because of new error 
https://issues.apache.org/jira/browse/IGNITE-16502



> SQL select operation could return incomplete data
> -------------------------------------------------
>
>                 Key: IGNITE-16406
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16406
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Assignee: Mirza Aliev
>            Priority: Blocker
>              Labels: ignite-3
>
> For some reasons select operation couldn't return expected number of rows. We 
> noticed that this happens when raft leader is changing. To increase 
> reproducibility, we can slow down a bit message handling, for example by 
> adding this code to {{MessageServiceImpl#onMessage(java.lang.String, 
> org.apache.ignite.network.NetworkMessage)}}
> {code:java}
>         if (ThreadLocalRandom.current().nextInt(3) % 2 == 0) {
>             try {
>                 Thread.sleep(300);
>             } catch (Exception ex) {
>                 ex.printStackTrace();
>             }
>         }
> {code}
> Possible direction of research: 
> we could check that we do not lose cursor.next command as a raft response 
> during the process of leader changing.
> UPD: 
> We decided to add checking for consistency between received scan command and 
> handled scan command in partition listener, so now a user will get state 
> machine error and could retry his command. But we found another inconsistency 
> when RocksDB could return hasNext == false after an unexpected step down of 
> the leader (https://issues.apache.org/jira/browse/IGNITE-16478).
> So, we decided then to change the replica factor to 1 in 
> {{ItMixedQueriesTest}}, so there will be only one node in a partition Raft 
> group, but we couldn't enable {{ItMixedQueriesTest}} because of new error 
> https://issues.apache.org/jira/browse/IGNITE-16502



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to