Re: Unknown Pdx Type use case found, bug or expected?

Anilkumar Gingade Fri, 10 Mar 2017 14:43:26 -0800

Roger,

If you are bringing down locators, servers, one after other, and making
sure the restarted node is part of cluster (before bringing down next); you
should not see theses issues...


What is your rolling upgrade procedure is...do you use a common disk-store
(from one server) to restart other nodes?

-Anil.








On Thu, Mar 9, 2017 at 2:56 PM, Roger Vandusen <
[email protected]> wrote:

> Hi Hitesh, thanks for the reply.
>
>
>
> I’ll take a look at your links.
>
>
>
> Yes, we did try to revoke the disk-stores manually w gfsh but this isn’t
> manageable going to production.
>
> I can’t recall the details of the revoke outcome but it did not solve our
> problem. I think the disk-store revoked was the pdx disk store, which would
> still potentially lead to ‘unknown pdx type’ right?
>
>
>
> Our main concern, in our scenario, was the corruption (unknown pdx types –
> unregistered or persisted) of server-side data from the client puts.
>
>
>
> -Roger
>
>
>
> *From: *Hitesh Khamesra <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>, Hitesh
> Khamesra <[email protected]>
> *Date: *Thursday, March 9, 2017 at 3:05 PM
> *To: *"[email protected]" <[email protected]>, Geode <
> [email protected]>
> *Subject: *Re: Unknown Pdx Type use case found, bug or expected?
>
>
>
> Hi Roger:
>
>
>
> Sorry to hear about this. There is system property on client side to clean
> pdx-registry when it disconnects from server. You can find details here
> https://discuss.pivotal.io/hc/en-us/articles/221351508-
> Getting-Stale-PDXType-error-on-client-after-clean-start-up-of-servers.
>
>
>
> I think we should clean pdx-registry when client disconnects. I will file
> the ticket to track this issue.
>
>
>
> For disk issue, here are some guidelines https://discuss.pivotal.io/hc/
> en-us/community/posts/208792347-Region-regionA-has-
> potentially-stale-data-It-is-waiting-for-another-member-to-
> recover-the-latest-data-.
>
>
>
> Did you try to revoke disk store?
>
>
>
> Thanks.
>
> hitesh
>
>
> ------------------------------
>
> *From:* Roger Vandusen <[email protected]>
> *To:* "[email protected]" <[email protected]>
> *Sent:* Thursday, March 9, 2017 12:55 PM
> *Subject:* Unknown Pdx Type use case found, bug or expected?
>
>
>
>
>
> Hey Geode,
>
>
>
> We have a 3 node server cluster running with pdx read serialized and disk
> store persistence for all regions and replication-factor=2.
>
>
>
> We do not use cluster-configuration, we use these property overrides:
>
>
>
>
> *#configuration settings used **enable-cluster-configuration*=
> *false use-cluster-configuration*=
>
>
> *false cache-xml-file=geode-cache.xml *
> *#property default overrides **distributed-system-id*=
> *1 log-level*=
> *config enforce-unique-host*=
> *true locator-wait-time*=
> *60 conserve-sockets*=
> *false log-file-size-limit*=
> *64 mcast-port*=*0*
>
>
>
> We use these stop/start scripts:
>
>
>
> STOP:
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop server --name=$SERVER_NAME"
>
>
>
> gfsh -e "connect --locator=$HOSTNAME[$LOCATOR_PORT]" \
>      -e "stop locator --name=$LOCATOR_NAME"
>
>
>
> START:
>
>
>
> gfsh start locator \
>   --properties-file=$CONF_DIR/geode.properties \
>   --name=$LOCATOR_NAME \
>   --port=$LOCATOR_PORT \
>   --log-level=config \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --enable-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-Dgemfire.jmx-manager=true \
>   --J=-Dgemfire.jmx-manger-start=true \
>   --J=-Xms512m \
>   --J=-Xmx512m
>
>
>
> gfsh start server \
>   --properties-file=$CONF_DIR/geode.properties \
>   --cache-xml-file=$CONF_DIR/geode-cache.xml \
>   --name=$SERVER_NAME \
>   --server-port=$SERVER_PORT \
>   --include-system-classpath=true \
>   --classpath=$CLASSPATH \
>   --start-rest-api=true \
>   --use-cluster-configuration=false \
>   --J=-Dlog4j.configurationFile=$CONF_DIR/log4j2.xml \
>   --J=-*Dgemfire.disk.recoverValues=false* \
>   --J=-Dgemfire.jmx-manager=false \
>   --J=-Dgemfire.jmx-manger-start=false \
>   --J=-Xms6g \
>   --J=-Xmx6g
>
>
>
>
>
> There were active proxy clients (1.0.0-incubating/GFE 9.0) connected while
> we proceeded to update the geode version from 1.0.0-incubating to 1.1.0.
>
>
>
> We did a ‘scripted’ rolling geode version upgrade redeployment by serially
> stopping/deploying/restarting each server node.
>
> We had this issue below, which we’ve seen before and still find difficult
> to solve:
>
> ‘Region /xxxx has potentially stale data. It is waiting for another member
> to recover the latest data.’
>
> The first node1 server hanging on restart and blocking our rolling serial
> redeployment.
>
>
>
> So after not being able to resolve this serial rolling update problem
> (again) we decided to delete all the data (currently just cached lookup
> tables and dev WIP/POC data),
>
> redeploy the new geode version and restart from scratch, so we then
> deleted all the diskstores (including pdx disk store) and restarted the
> cluster.
>
>
>
> REMINDER: the clients were all still connected and not restarted!!! (see
> link below for our awareness now of this CLIENT-SIDE error state)
>
> These clients then put data into server cluster, the ‘put’s succeeded, the
> server regions show they have the data.
>
>
>
> BUT now gfsh query of this server region data gives ‘Unknown pdx types’
> and restarting the clients fails on connecting to these regions with the
> same error: ‘Unknown pdx type’.
>
>
>
> We are seeking GEODE-USER feedback regarding:
>
>
>
> 1)       We need to find a working enterprise deployment solution to
> resolve the rolling restart problem with stale data alerts blocking cluster
> config/version updates?
>
> 2)       We don’t believe the problem we saw was related to version
> upgrading?
>
> 3)       We find it very concerning that connected clients can CORRUPT
> SEVER-SIDE region data and don’t update the pdx registry and diskstore upon
> ‘put’s?
>
> A FAIL of the client-side proxy region.put would make more sense?
>
> Why didn’t the pdx types cached on the client get registered and written
> back to the servers diskstores?
>
> The client PUTs DID write data into the server regions – but that data is
> now corrupted and unreadable as ‘Unknown pdx types’?
>
> That is a major issue, even though we acknowledge that we would NOT be
> deleting active diskstores from running clusters in production, assuming we
> can solve the rolling updates problem.
>
>
>
> We  are now aware of this CLIENT-SIDE error state and can see how it might
> be related to our redeployment use case above but we now have corrupted
> SERVER-SIDE data written in server regions:
>
> https://discuss.pivotal.io/hc/en-us/articles/206357497-
> IllegalStateException-Unknown-PDX-Type-on-Client-Side
>
>
>
>
>
> -Roger
>
>
>
>
>

Re: Unknown Pdx Type use case found, bug or expected?

Reply via email to