Re: Unknown Pdx Type use case found, bug or expected?

Roger Vandusen Fri, 10 Mar 2017 15:18:02 -0800

Anil,

We stop, deploy updates and restart each node (one locator, one server per 
node) serially, one at a time, and each node has their own local diskstores.

-Roger

From: Anilkumar Gingade <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Friday, March 10, 2017 at 3:42 PM
To: "[email protected]" <[email protected]>
Cc: Hitesh Khamesra <[email protected]>
Subject: Re: Unknown Pdx Type use case found, bug or expected?

Roger,

If you are bringing down locators, servers, one after other, and making sure 
the restarted node is part of cluster (before bringing down next); you should 
not see theses issues...

What is your rolling upgrade procedure is...do you use a common disk-store 
(from one server) to restart other nodes?

-Anil.

On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen 
<[email protected]<mailto:[email protected]>> wrote:
Hi Hitesh, thanks for the reply.

I’ll take a look at your links.

Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t 
manageable going to production.
I can’t recall the details of the revoke outcome but it did not solve our 
problem. I think the disk-store revoked was the pdx disk store, which would 
still potentially lead to ‘unknown pdx type’ right?

Our main concern, in our scenario, was the corruption (unknown pdx types – 
unregistered or persisted) of server-side data from the client puts.

-Roger

From: Hitesh Khamesra <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Hitesh Khamesra 
<[email protected]<mailto:[email protected]>>
Date: Thursday, March 9, 2017 at 3:05 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>, Geode 
<[email protected]<mailto:[email protected]>>
Subject: Re: Unknown Pdx Type use case found, bug or expected?

Hi Roger:

Sorry to hear about this. There is system property on client side to clean 
pdx-registry when it disconnects from server. You can find details here 
https://discuss.pivotal.io/hc/en-us/articles/221351508-Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.

I think we should clean pdx-registry when client disconnects. I will file the 
ticket to track this issue.

For disk issue, here are some guidelines 
https://discuss.pivotal.io/hc/en-us/community/posts/208792347-Region-regionA-has-potentially-stale-data-It-is-waiting-for-another-member-to-recover-the-latest-data-.

Did you try to revoke disk store?

Thanks.
hitesh

________________________________
From: Roger Vandusen 
<[email protected]<mailto:[email protected]>>
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Sent: Thursday, March 9, 2017 12:55 PM
Subject: Unknown Pdx Type use case found, bug or expected?

Hey Geode,

We have a 3 node server cluster running with pdx read serialized and disk store 
persistence for all regions and replication-factor=2.

We do not use cluster-configuration, we use these property overrides:

#configuration settings used
enable-cluster-configuration=false
use-cluster-configuration=false
cache-xml-file=geode-cache.xml

#property default overrides
distributed-system-id=1
log-level=config
enforce-unique-host=true
locator-wait-time=60
conserve-sockets=false
log-file-size-limit=64
mcast-port=0

We use these stop/start scripts:

STOP:

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop server --name=$SERVER_NAME"

gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
     -e "stop locator --name=$LOCATOR_NAME"

START:

gfsh start locator \
  --properties-file=$CONF_DIR/geode.properties \
  --name=$LOCATOR_NAME \
  --port=$LOCATOR_PORT \
  --log-level=config \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --enable-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.jmx-manager=true \
  --J=-Dgemfire.jmx-manger-start=true \
  --J=-Xms512m \
  --J=-Xmx512m

gfsh start server \
  --properties-file=$CONF_DIR/geode.properties \
  --cache-xml-file=$CONF_DIR/geode-cache.xml \
  --name=$SERVER_NAME \
  --server-port=$SERVER_PORT \
  --include-system-classpath=true \
  --classpath=$CLASSPATH \
  --start-rest-api=true \
  --use-cluster-configuration=false \
  --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
  --J=-Dgemfire.disk.recoverValues=false \
  --J=-Dgemfire.jmx-manager=false \
  --J=-Dgemfire.jmx-manger-start=false \
  --J=-Xms6g \
  --J=-Xmx6g

There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while we 
proceeded to update the geode version from 1.0.0-incubating to 1.1.0.

We did a ‘scripted’ rolling geode version upgrade redeployment by serially 
stopping/deploying/restarting each server node.
We had this issue below, which we’ve seen before and still find difficult to 
solve:
‘Region /xxxx has potentially stale data. It is waiting for another member to 
recover the latest data.’
The first node1 server hanging on restart and blocking our rolling serial 
redeployment.

So after not being able to resolve this serial rolling update problem (again) 
we decided to delete all the data (currently just cached lookup tables and dev 
WIP/POC data),
redeploy the new geode version and restart from scratch, so we then deleted all 
the diskstores (including pdx disk store) and restarted the cluster.

REMINDER: the clients were all still connected and not restarted!!! (see link 
below for our awareness now of this CLIENT-SIDE error state)
These clients then put data into server cluster, the ‘put’s succeeded, the 
server regions show they have the data.

BUT now gfsh query of this server region data gives ‘Unknown pdx types’ and 
restarting the clients fails on connecting to these regions with the same 
error: ‘Unknown pdx type’.

We are seeking GEODE-USER feedback regarding:

1)       We need to find a working enterprise deployment solution to resolve 
the rolling restart problem with stale data alerts blocking cluster 
config/version updates?
2)       We don’t believe the problem we saw was related to version upgrading?
3)       We find it very concerning that connected clients can CORRUPT 
SEVER-SIDE region data and don’t update the pdx registry and diskstore upon 
‘put’s?
A FAIL of the client-side proxy region.put would make more sense?
Why didn’t the pdx types cached on the client get registered and written back 
to the servers diskstores?
The client PUTs DID write data into the server regions – but that data is now 
corrupted and unreadable as ‘Unknown pdx types’?
That is a major issue, even though we acknowledge that we would NOT be deleting 
active diskstores from running clusters in production, assuming we can solve 
the rolling updates problem.

We  are now aware of this CLIENT-SIDE error state and can see how it might be 
related to our redeployment use case above but we now have corrupted 
SERVER-SIDE data written in server regions:
https://discuss.pivotal.io/hc/en-us/articles/206357497-IllegalStateException-Unknown-PDX-Type-on-Client-Side

-Roger

Re: Unknown Pdx Type use case found, bug or expected?

Reply via email to