On 05/23/2017 01:10 PM, Octave J. Orgeron wrote:
Comments below..
On 5/21/2017 1:38 PM, Monty Taylor wrote:
For example: An HA strategy using slave promotion and a VIP that
points at the current write master paired with an application
incorrectly configured to do such a thing can lead to writes to the
wrong host after a failover event and an application that seems to be
running fine until the data turns up weird after a while.
This is definitely a more complicated area that becomes more and more
specific to the clustering technology being used. Galera vs. MySQL
Cluster is a good example. Galera has an active/passive architecture
where the above issues become a concern for sure.
This is not my understanding; Galera is multi-master and if you lose a
node, you don't lose any committed transactions; the writesets are
validated as acceptable by, and pushed out to all nodes before your
commit succeeds. There's an option to make it wait until all those
writesets are fully written to disk as well, but even with that option
flipped off, if you COMMIT to one node then that node explodes, you lose
nothing. your writesets have been verified as will be accepted by all
the other nodes.
active/active is the second bullet point on the main homepage:
http://galeracluster.com/products/
In the "active" approach, we still document expectations, but we also
validate them. If they are not what we expect but can be changed at
runtime, we change them overriding conflicting environmental config,
and if we can't, we hard-stop indicating an unsuitable environment.
Rather than providing helper tools, we perform the steps needed
ourselves, in the order they need to be performed, ensuring that they
are done in the manner in which they need to be done.
This might be a trickier situation, especially if the database(s) are in
a separate or dedicated environment that the OpenStack service processes
don't have access to. Of course for SQL commands, this isn't a problem.
But changing the configuration files and restarting the database may be
a harder thing to expect.
nevertheless the HA setup within tripleo does do this, currently using
Pacemaker and resource agents. This is within the scope of at least
parts of Openstack.
In either approach the OpenStack service has to be able to talk to
both old and new versions of the schema. And in either approach we
need to make sure to limit the schema change operations to the set
that can be accomplished in an online fashion. We also have to be
careful to not start writing values to new columns until all of the
nodes have been updated, because the replication stream can't
replicate the new column value to nodes that don't have the new column.
This is another area where something like MySQL Cluster (NDB) would
operate differently because it's an active/active architecture. So
limiting the number of online changes while a table is locked across the
cluster would be very important. There is also the timeouts for the
applications to consider, something that could be abstracted again with
oslo.db.
So the DDL we do on Galera, to confirm but also clarify Monty's point,
is under the realm of "total order isolation", which means it's going to
hold up the whole cluster while DDL is applied to all nodes. Monty
says this disqualifies it as an "online upgrade", which is because if
you emitted DDL that had to run default values into a million rows then
your whole cluster would temporarily have to wait for that to happen; we
handle that by making sure we don't do migrations with that kind of data
requirement and while yes, the DB has to wait for a schema change to
apply, they are at least very short (in theory). For practical
purposes, it is *mostly* an "online" style of migration because all the
services that talk to the database can keep on talking to the database
without being stopped, upgraded to new software version, and restarted,
which IMO is what's really hard about "online" upgrades. It does mean
that services will just have a little more latency while operations
proceed. Maybe we need a new term called "quasi-online" or something
like that.
Facebook has released a Python version of their "online" schema
migration tool for MySQL which does the full blown "create a new, blank
table" approach, e.g. which contains the newer version of the schema, so
that nothing at all stops or slows down at all. And then to manage
between the two tables while everything is running it also makes a
"change capture" table to keep track of what's going on, and then to
wire it all together it uses...triggers!
https://github.com/facebookincubator/OnlineSchemaChange/wiki/How-OSC-works.
Crazy Facebook kids. How we know that "make two more tables and wire
it all together with new triggers" in fact is more performant than just,
"add a column to the table", I'm not sure how/when they make that
determination. I don't see an Openstack cluster as quite the same
thing as hosting a site like Facebook so I lean towards the more liberal
interpretation of "online upgrades".
* Versions
It's worth noting that behavior for schema updates and other things
change over time with backend database version. We set minimum
versions of other things, like libvirt and OVS - so we might also want
to set minimum versions for what we can support in the database. That
way we can know for a given release of OpenStack what DDL operations
are safe to use for a rolling upgrade and what are not. That means
detecting such a version and potentially refusing to perform an
upgrade if the version isn't acceptable. That reduces the operator's
ability to choose what version of the database software to run, but
increases our ability to be able to provide tooling and operations
that we can be confident will work.
Validating the MySQL database version is a good idea. The features do
change over time. A good example is how in 5.7, you'll get warnings
about duplicate indexes being dropped in a future release which will
definitely affect multiple services today.
== Summary ==
These are just a couple of examples - but I hope they're at least
mildly useful to explain some of the sorts of issues at hand - and why
I think we need to clarify what our intent is separate from the issue
of what databases we "support".
Some operations have one and only one "right" way to be done. For
those operations if we take an 'active' approach, we can implement
them once and not make all of our deployers and distributors each
implement and run them. However, there is a cost to that. Automatic
and prescriptive behavior has a higher dev cost that is proportional
to the number of supported architectures. This then implies a need to
limit deployer architecture choices.
On the other hand, taking an 'external' approach allows us to federate
the work of supporting the different architectures to the deployers.
This means more work on the deployer's part, but also potentially a
greater amount of freedom on their part to deploy supporting services
the way they want. It means that some of the things that have been
requested of us - such as easier operation and an increase in the
number of things that can be upgraded with no-downtime - might become
prohibitively costly for us to implement.
I honestly think that both are acceptable choices we can make and that
for any given topic there are middle grounds to be found at any given
moment in time.
BUT - without a decision as to what our long-term philosophical intent
in this space is that is clear and understandable to everyone, we
cannot have successful discussions about the impact of implementation
choices, since we will not have a shared understanding of the problem
space or the solutions we're talking about.
For my part - I hear complaints that OpenStack is 'difficult' to
operate and requests for us to make it easier. This is why I have been
advocating some actions that are clearly rooted in an 'active' worldview.
Finally, this is focused on the database layer but similar questions
arise in other places. What is our philosophy on prescriptive/active
choices on our part coupled with automated action and ease of
operation vs. expanded choices for the deployer at the expense of
configuration and operational complexity. For now let's see if we can
answer it for databases, and see where that gets us.
Thanks for reading.
Monty
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev