Roger, If you are bringing down locators, servers, one after other, and making sure the restarted node is part of cluster (before bringing down next); you should not see theses issues...
What is your rolling upgrade procedure is...do you use a common disk-store (from one server) to restart other nodes? -Anil. On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen < [email protected]> wrote: > Hi Hitesh, thanks for the reply. > > > > I’ll take a look at your links. > > > > Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t > manageable going to production. > > I can’t recall the details of the revoke outcome but it did not solve our > problem. I think the disk-store revoked was the pdx disk store, which would > still potentially lead to ‘unknown pdx type’ right? > > > > Our main concern, in our scenario, was the corruption (unknown pdx types – > unregistered or persisted) of server-side data from the client puts. > > > > -Roger > > > > *From: *Hitesh Khamesra <[email protected]> > *Reply-To: *"[email protected]" <[email protected]>, Hitesh > Khamesra <[email protected]> > *Date: *Thursday, March 9, 2017 at 3:05 PM > *To: *"[email protected]" <[email protected]>, Geode < > [email protected]> > *Subject: *Re: Unknown Pdx Type use case found, bug or expected? > > > > Hi Roger: > > > > Sorry to hear about this. There is system property on client side to clean > pdx-registry when it disconnects from server. You can find details here > https://discuss.pivotal.io/hc/en-us/articles/221351508- > Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers. > > > > I think we should clean pdx-registry when client disconnects. I will file > the ticket to track this issue. > > > > For disk issue, here are some guidelines https://discuss.pivotal.io/hc/ > en-us/community/posts/208792347-Region-regionA-has- > potentially-stale-data-It-is-waiting-for-another-member-to- > recover-the-latest-data-. > > > > Did you try to revoke disk store? > > > > Thanks. > > hitesh > > > ------------------------------ > > *From:* Roger Vandusen <[email protected]> > *To:* "[email protected]" <[email protected]> > *Sent:* Thursday, March 9, 2017 12:55 PM > *Subject:* Unknown Pdx Type use case found, bug or expected? > > > > > > Hey Geode, > > > > We have a 3 node server cluster running with pdx read serialized and disk > store persistence for all regions and replication-factor=2. > > > > We do not use cluster-configuration, we use these property overrides: > > > > > *#configuration settings used **enable-cluster-configuration*= > *false use-cluster-configuration*= > > > *false cache-xml-file=geode-cache.xml * > *#property default overrides **distributed-system-id*= > *1 log-level*= > *config enforce-unique-host*= > *true locator-wait-time*= > *60 conserve-sockets*= > *false log-file-size-limit*= > *64 mcast-port*=*0* > > > > We use these stop/start scripts: > > > > STOP: > > > > gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \ > -e "stop server --name=$SERVER_NAME" > > > > gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \ > -e "stop locator --name=$LOCATOR_NAME" > > > > START: > > > > gfsh start locator \ > --properties-file=$CONF_DIR/geode.properties \ > --name=$LOCATOR_NAME \ > --port=$LOCATOR_PORT \ > --log-level=config \ > --include-system-classpath=true \ > --classpath=$CLASSPATH \ > --enable-cluster-configuration=false \ > --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \ > --J=-Dgemfire.jmx-manager=true \ > --J=-Dgemfire.jmx-manger-start=true \ > --J=-Xms512m \ > --J=-Xmx512m > > > > gfsh start server \ > --properties-file=$CONF_DIR/geode.properties \ > --cache-xml-file=$CONF_DIR/geode-cache.xml \ > --name=$SERVER_NAME \ > --server-port=$SERVER_PORT \ > --include-system-classpath=true \ > --classpath=$CLASSPATH \ > --start-rest-api=true \ > --use-cluster-configuration=false \ > --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \ > --J=-*Dgemfire.disk.recoverValues=false* \ > --J=-Dgemfire.jmx-manager=false \ > --J=-Dgemfire.jmx-manger-start=false \ > --J=-Xms6g \ > --J=-Xmx6g > > > > > > There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while > we proceeded to update the geode version from 1.0.0-incubating to 1.1.0. > > > > We did a ‘scripted’ rolling geode version upgrade redeployment by serially > stopping/deploying/restarting each server node. > > We had this issue below, which we’ve seen before and still find difficult > to solve: > > ‘Region /xxxx has potentially stale data. It is waiting for another member > to recover the latest data.’ > > The first node1 server hanging on restart and blocking our rolling serial > redeployment. > > > > So after not being able to resolve this serial rolling update problem > (again) we decided to delete all the data (currently just cached lookup > tables and dev WIP/POC data), > > redeploy the new geode version and restart from scratch, so we then > deleted all the diskstores (including pdx disk store) and restarted the > cluster. > > > > REMINDER: the clients were all still connected and not restarted!!! (see > link below for our awareness now of this CLIENT-SIDE error state) > > These clients then put data into server cluster, the ‘put’s succeeded, the > server regions show they have the data. > > > > BUT now gfsh query of this server region data gives ‘Unknown pdx types’ > and restarting the clients fails on connecting to these regions with the > same error: ‘Unknown pdx type’. > > > > We are seeking GEODE-USER feedback regarding: > > > > 1) We need to find a working enterprise deployment solution to > resolve the rolling restart problem with stale data alerts blocking cluster > config/version updates? > > 2) We don’t believe the problem we saw was related to version > upgrading? > > 3) We find it very concerning that connected clients can CORRUPT > SEVER-SIDE region data and don’t update the pdx registry and diskstore upon > ‘put’s? > > A FAIL of the client-side proxy region.put would make more sense? > > Why didn’t the pdx types cached on the client get registered and written > back to the servers diskstores? > > The client PUTs DID write data into the server regions – but that data is > now corrupted and unreadable as ‘Unknown pdx types’? > > That is a major issue, even though we acknowledge that we would NOT be > deleting active diskstores from running clusters in production, assuming we > can solve the rolling updates problem. > > > > We are now aware of this CLIENT-SIDE error state and can see how it might > be related to our redeployment use case above but we now have corrupted > SERVER-SIDE data written in server regions: > > https://discuss.pivotal.io/hc/en-us/articles/206357497- > IllegalStateException-Unknown-PDX-Type-on-Client-Side > > > > > > -Roger > > > > >
