Hi, Following on this topic. I've been further analyzing the issue and I came to the following conclusion: This scenario would only happen if:
1. PdxTypes persistence is disabled, and the cluster is restarted after the PdxInstance is created but before the entry is put to the region. 2. PdxTypes persistence is enabled, and a backup is restored after a PdxInstance which PdxType is missing in the backup, and before the entry is put to the region. For the first case, it could be solved by just enabling PdxTypes persistence. So, I am wondering if there is somewhere in the documentation where it is recommended to enable PdxTypes persistence whenever using Pdx serialization? For the second case, I came with 2 solutions: * Recommend restarting the clients/stop the traffic while a cluster restore is ongoing. Is there any mention about this in the documentation? * Also, if the PdxType is verified just after the put happens, not whenever the PdxInstance is created, this situation could be avoided if (connect-timeout + read-timeout ) * max-attemps < restore-time. Given the put message will fail, as the cluster won't be up. Also, another solution I thought is to modify/ the protocol to include all PdxTypes for a PdxInstance PUT so it can be verified on the server side if those PdxTypes exists in the cluster. This solution, even it would solve this issue, requires lots of effort and will add some computing overhead. So, I wanted to ask your thoughts on this. And, also, are you aware if there is any B&R documentation mentioning this issue? Thanks, Mario. ________________________________ From: Mario Salazar de Torres <mario.salazar.de.tor...@est.tech> Sent: Wednesday, May 5, 2021 10:44 AM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: Region data corruption due to missing PdxTypes Hi, I forgot to mention that I enabled ON_DISCONNECT_CLEAR_PDXTYPEIDS property. Also, I tried a different scenario which does not exactly involves local PdxType retention, which is: 1. Start a cluster with 1 locator and 3 servers, and persistence is disabled for PdxTypes. 2. Setup a region called "test-region" with persistence disabled. It doesn't mind whether is replicated or partitioned. 3. In the client, instantiate the client region with PROXY region shortcut and establish the connection toward the cluster. 4. In the client, create a PdxInstance. 5. At this point, cluster is restarted, meaning that all the data is lost, included PdxTypes. 6. In the client, the PdxInstance created in step 4 is put into "test-region" with key "test". 7. In the client, the following query is executed: "SELECT * FROM /test-region WHERE value = -1". The outcome is the same, query fails with the message "Unknown pdx type=<PdxType ID>" and it won't work until the corrupted entry is removed. I don't know if you've seen this kind of scenarios before. I am just wondering in case this is something that needs to be fixed. Thanks, Mario. ________________________________ From: Anthony Baker <bak...@vmware.com> Sent: Wednesday, May 5, 2021 1:06 AM To: dev@geode.apache.org <dev@geode.apache.org> Subject: Re: Region data corruption due to missing PdxTypes Retaining local pox types in the client after a disconnect will cause problems as you observed. Take a look at the “ON_DISCONNECT_CLEAR_PDXTYPEIDS” property to improve this. Anthony > On May 4, 2021, at 4:36 AM, Mario Salazar de Torres > <mario.salazar.de.tor...@est.tech> wrote: > > Hi everyone, > > While debugging some coredumps in the native client related to > PdxTypeRegistry cleanup, I tried to reproduce the scenario with the Java > client API to see how it was handled. > Thing is I've noticed that this scenario in the Java client might lead to > Geode storing a corrupted entry, meaning that queries won't work on those > regions containing corrupted entries. > And with corrupted entries, I refer to entries using a missing PdxType. The > scenario involves a cluster restart. It's described below: > > 1. Start a cluster with 1 locator and 3 servers, and persistence is > disabled for PdxTypes. > 2. Setup a region called "test-region" with persistence disabled. It > doesn't mind whether is replicated or partitioned. > 3. In the client, instantiate the client region with PROXY region shortcut > and establish the connection toward the cluster. > 4. In the client, create a PdxInstance and put in into the "test-region" > with key "test". > 5. In the client, get the entry which key is "test", which turns out to be > the PdxInstance inserted in step 4. > 6. At this point, cluster is restarted, meaning that all the data is lost, > included PdxTypes. > 7. In the client, the PdxInstance obtained in step 5 is put into > "test-region" with key "test2" > 8. In the client, the following query is executed: "SELECT * FROM > /test-region WHERE value = -1". > Such query fails with the message "Unknown pdx type=<PdxType ID>" and it > won't work until the corrupted entry is removed. > > Also, the above scenario could be solved by enabling persistence for > PdxTypes, but if you have an unrecoverable issue in your cluster and you need > to spin up a backup, > it could happen that PdxInstance's PdxType obtained step 5 is not present in > the backup, leading to the entry being inserted but, yet again, the PdxType > being missing. > > It's worth mentioning that in the native client, this scenario currently > results in a coredump, but no data corruption, > given that after losing the connection towards the cluster PdxTypeRegistry is > cleaned up and PdxTypes are obtained with its ID, rather than directly using > the object. > > My question here are: > > * Have you seen this issue before? > * Is there a way to verify that PdxTypes are present in the cluster before > writing an entry which holds some PdxInstances? > > Thanks, > Mario.