Re: Asynchronous registration of binary metadata

Denis Mekhanikov Thu, 22 Aug 2019 05:18:58 -0700

Alexey,

Making only one node write metadata to disk synchronously is a possible and 
easy to implement solution, but it still has a few drawbacks:


• Discovery will still be blocked on one node. This is better than blocking all 
nodes one by one, but disk write may take indefinite time, so discovery may 
still be affected.
• There is an unlikely but at the same time an unpleasant case:
    1. A coordinator writes metadata synchronously to disk and finalizes the 
metadata registration. Other nodes do it asynchronously, so actual fsync to a 
disk may be delayed.
    2. A transaction is committed.
    3. The cluster is shut down before all nodes finish their fsync of metadata.
    4. Nodes are started again one by one.
    5. Before the previous coordinator is started again, a read operation tries 
to read the data, that uses the metadata that wasn’t fsynced anywhere except 
the coordinator, which is still not started.
    6. Error about unknown metadata is generated.

In the scheme, that Sergey and me proposed, this situation isn’t possible, 
since the data won’t be written to disk until fsync is finished. Every mapped 
node will wait on a future until metadata is written to disk before performing 
any cache changes.
What do you think about such fix?

Denis
On 22 Aug 2019, 12:44 +0300, Alexei Scherbakov <[email protected]>, 
wrote:
> Denis Mekhanikov,
>
> I think at least one node (coordinator for example) still should write
> metadata synchronously to protect from a scenario:
>
> tx creating new metadata is commited <- all nodes in grid are failed
> (powered off) <- async writing to disk is completed
>
> where <- means "happens before"
>
> All other nodes could write asynchronously, by using separate thread or not
> doing fsync( same effect)
>
>
>
> ср, 21 авг. 2019 г. в 19:48, Denis Mekhanikov <[email protected]>:
>
> > Alexey,
> >
> > I’m not suggesting to duplicate anything.
> > My point is that the proper fix will be implemented in a relatively
> > distant future. Why not improve the existing mechanism now instead of
> > waiting for the proper fix?
> > If we don’t agree on doing this fix in master, I can do it in a fork and
> > use it in my setup. So please let me know if you see any other drawbacks in
> > the proposed solution.
> >
> > Denis
> >
> > > On 21 Aug 2019, at 15:53, Alexei Scherbakov <
> > [email protected]> wrote:
> > >
> > > Denis Mekhanikov,
> > >
> > > If we are still talking about "proper" solution the metastore (I've meant
> > > of course distributed one) is the way to go.
> > >
> > > It has a contract to store cluster wide metadata in most efficient way
> > and
> > > can have any optimization for concurrent writing inside.
> > >
> > > I'm against creating some duplicating mechanism as you suggested. We do
> > not
> > > need another copy/paste code.
> > >
> > > Another possibility is to carry metadata along with appropriate request
> > if
> > > it's not found locally but this is a rather big modification.
> > >
> > >
> > >
> > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <[email protected]>:
> > >
> > > > Eduard,
> > > >
> > > > Usages will wait for the metadata to be registered and written to disk.
> > No
> > > > races should occur with such flow.
> > > > Or do you have some specific case on your mind?
> > > >
> > > > I agree, that using a distributed meta storage would be nice here.
> > > > But this way we will kind of move to the previous scheme with a
> > replicated
> > > > system cache, where metadata was stored before.
> > > > Will scheme with the metastorage be different in any way? Won’t we
> > decide
> > > > to move back to discovery messages again after a while?
> > > >
> > > > Denis
> > > >
> > > >
> > > > > On 20 Aug 2019, at 15:13, Eduard Shangareev <
> > [email protected]>
> > > > wrote:
> > > > >
> > > > > Denis,
> > > > > How would we deal with races between registration and metadata usages
> > > > with
> > > > > such fast-fix?
> > > > >
> > > > > I believe, that we need to move it to distributed metastorage, and
> > await
> > > > > registration completeness if we can't find it (wait for work in
> > > > progress).
> > > > > Discovery shouldn't wait for anything here.
> > > > >
> > > > > On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <
> > [email protected]
> > > > >
> > > > > wrote:
> > > > >
> > > > > > Sergey,
> > > > > >
> > > > > > Currently metadata is written to disk sequentially on every node. 
> > > > > > Only
> > > > one
> > > > > > node at a time is able to write metadata to its storage.
> > > > > > Slowness accumulates when you add more nodes. A delay required to
> > write
> > > > > > one piece of metadata may be not that big, but if you multiply it by
> > say
> > > > > > 200, then it becomes noticeable.
> > > > > > But If we move the writing out from discovery threads, then nodes 
> > > > > > will
> > > > be
> > > > > > doing it in parallel.
> > > > > >
> > > > > > I think, it’s better to block some threads from a striped pool for a
> > > > > > little while rather than blocking discovery for the same period, but
> > > > > > multiplied by a number of nodes.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Denis
> > > > > >
> > > > > > > On 15 Aug 2019, at 10:26, Sergey Chugunov 
> > > > > > > <[email protected]
> > >
> > > > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > Thanks for bringing this issue up, decision to write binary 
> > > > > > > metadata
> > > > from
> > > > > > > discovery thread was really a tough decision to make.
> > > > > > > I don't think that moving metadata to metastorage is a silver 
> > > > > > > bullet
> > > > here
> > > > > > > as this approach also has its drawbacks and is not an easy change.
> > > > > > >
> > > > > > > In addition to workarounds suggested by Alexei we have two 
> > > > > > > choices to
> > > > > > > offload write operation from discovery thread:
> > > > > > >
> > > > > > > 1. Your scheme with a separate writer thread and futures completed
> > > > when
> > > > > > > write operation is finished.
> > > > > > > 2. PME-like protocol with obvious complications like failover and
> > > > > > > asynchronous wait for replies over communication layer.
> > > > > > >
> > > > > > > Your suggestion looks easier from code complexity perspective but 
> > > > > > > in
> > my
> > > > > > > view it increases chances to get into starvation. Now if some node
> > > > faces
> > > > > > > really long delays during write op it is gonna be kicked out of
> > > > topology
> > > > > > by
> > > > > > > discovery protocol. In your case it is possible that more and more
> > > > > > threads
> > > > > > > from other pools may stuck waiting on the operation future, it is
> > also
> > > > > > not
> > > > > > > good.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > I also think that if we want to approach this issue 
> > > > > > > systematically,
> > we
> > > > > > need
> > > > > > > to do a deep analysis of metastorage option as well and to finally
> > > > choose
> > > > > > > which road we wanna go.
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
> > > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 1. Yes, only on OS failures. In such case data will be 
> > > > > > > > > > received
> > from
> > > > > > > > alive
> > > > > > > > > > nodes later.
> > > > > > > > What behavior would be in case of one node ? I suppose someone 
> > > > > > > > can
> > > > > > obtain
> > > > > > > > cache data without unmarshalling schema, what in this case 
> > > > > > > > would be
> > > > with
> > > > > > > > grid operability?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > > 2. Yes, for walmode=FSYNC writes to metastore will be slow. 
> > > > > > > > > > But
> > such
> > > > > > > > mode
> > > > > > > > > > should not be used if you have more than two nodes in grid 
> > > > > > > > > > because
> > > > it
> > > > > > > > has
> > > > > > > > > > huge impact on performance.
> > > > > > > > Is wal mode affects metadata store ?
> > > > > > > >
> > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
> > > > [email protected]
> > > > > > > > > :
> > > > > > > > > >
> > > > > > > > > > > Folks,
> > > > > > > > > > >
> > > > > > > > > > > Thanks for showing interest in this issue!
> > > > > > > > > > >
> > > > > > > > > > > Alexey,
> > > > > > > > > > >
> > > > > > > > > > > > I think removing fsync could help to mitigate 
> > > > > > > > > > > > performance issues
> > > > > > with
> > > > > > > > > > > current implementation
> > > > > > > > > > >
> > > > > > > > > > > Is my understanding correct, that if we remove fsync, then
> > > > discovery
> > > > > > > > won’t
> > > > > > > > > > > be blocked, and data will be flushed to disk in 
> > > > > > > > > > > background, and
> > > > loss
> > > > > > of
> > > > > > > > > > > information will be possible only on OS failure? It 
> > > > > > > > > > > sounds like
> > an
> > > > > > > > > > > acceptable workaround to me.
> > > > > > > > > > >
> > > > > > > > > > > Will moving metadata to metastore actually resolve this 
> > > > > > > > > > > issue?
> > > > Please
> > > > > > > > > > > correct me if I’m wrong, but we will still need to write 
> > > > > > > > > > > the
> > > > > > > > information to
> > > > > > > > > > > WAL before releasing the discovery thread. If WAL mode is 
> > > > > > > > > > > FSYNC,
> > > > then
> > > > > > > > the
> > > > > > > > > > > issue will still be there. Or is it planned to abandon the
> > > > > > > > discovery-based
> > > > > > > > > > > protocol at all?
> > > > > > > > > > >
> > > > > > > > > > > Evgeniy, Ivan,
> > > > > > > > > > >
> > > > > > > > > > > In my particular case the data wasn’t too big. It was a 
> > > > > > > > > > > slow
> > > > > > > > virtualised
> > > > > > > > > > > disk with encryption, that made operations slow. Given 
> > > > > > > > > > > that there
> > > > are
> > > > > > > > 200
> > > > > > > > > > > nodes in a cluster, where every node writes slowly, and 
> > > > > > > > > > > this
> > > > process
> > > > > > is
> > > > > > > > > > > sequential, one piece of metadata is registered extremely 
> > > > > > > > > > > slowly.
> > > > > > > > > > >
> > > > > > > > > > > Ivan, answering to your other questions:
> > > > > > > > > > >
> > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory 
> > > > > > > > > > > > caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > accidentally?
> > > > > > > > > > >
> > > > > > > > > > > It should be checked, if it’s safe to stop writing 
> > > > > > > > > > > marshaller
> > > > > > mappings
> > > > > > > > to
> > > > > > > > > > > disk without loosing any guarantees.
> > > > > > > > > > > But anyway, I would like to have a property, that would 
> > > > > > > > > > > control
> > > > this.
> > > > > > > > If
> > > > > > > > > > > metadata registration is slow, then initial cluster 
> > > > > > > > > > > warmup may
> > > > take a
> > > > > > > > > > > while. So, if we preserve metadata on disk, then we will 
> > > > > > > > > > > need to
> > > > warm
> > > > > > > > it up
> > > > > > > > > > > only once, and further restarts won’t be affected.
> > > > > > > > > > >
> > > > > > > > > > > > Do we really need a fast fix here?
> > > > > > > > > > >
> > > > > > > > > > > I would like a fix, that could be implemented now, since 
> > > > > > > > > > > the
> > > > activity
> > > > > > > > with
> > > > > > > > > > > moving metadata to metastore doesn’t sound like a quick 
> > > > > > > > > > > one.
> > > > Having a
> > > > > > > > > > > temporary solution would be nice.
> > > > > > > > > > >
> > > > > > > > > > > Denis
> > > > > > > > > > >
> > > > > > > > > > > > On 14 Aug 2019, at 11:53, Павлухин Иван < 
> > > > > > > > > > > > [email protected] >
> > > > > > > > wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > Denis,
> > > > > > > > > > > >
> > > > > > > > > > > > Several clarifying questions:
> > > > > > > > > > > > 1. Do you have an idea why metadata registration takes 
> > > > > > > > > > > > so long?
> > So
> > > > > > > > > > > > poor disks? So many data to write? A contention with 
> > > > > > > > > > > > disk writes
> > > > by
> > > > > > > > > > > > other subsystems?
> > > > > > > > > > > > 2. Do we need a persistent metadata for in-memory 
> > > > > > > > > > > > caches? Or is
> > it
> > > > > > so
> > > > > > > > > > > > accidentally?
> > > > > > > > > > > >
> > > > > > > > > > > > Generally, I think that it is possible to move metadata 
> > > > > > > > > > > > saving
> > > > > > > > > > > > operations out of discovery thread without loosing 
> > > > > > > > > > > > required
> > > > > > > > > > > > consistency/integrity.
> > > > > > > > > > > >
> > > > > > > > > > > > As Alex mentioned using metastore looks like a better 
> > > > > > > > > > > > solution.
> > Do
> > > > > > we
> > > > > > > > > > > > really need a fast fix here? (Are we talking about fast 
> > > > > > > > > > > > fix?)
> > > > > > > > > > > >
> > > > > > > > > > > > ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
> > > > > > > > > > > < [email protected] >:
> > > > > > > > > > > > >
> > > > > > > > > > > > > Alexey, but in this case customer need to be 
> > > > > > > > > > > > > informed, that
> > whole
> > > > > > > > (for
> > > > > > > > > > > example 1 node) cluster crash (power off) could lead to 
> > > > > > > > > > > partial
> > > > data
> > > > > > > > > > > unavailability.
> > > > > > > > > > > > > And may be further index corruption.
> > > > > > > > > > > > > 1. Why your meta takes a substantial size? may be 
> > > > > > > > > > > > > context
> > > > leaking ?
> > > > > > > > > > > > > 2. Could meta be compressed ?
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Среда, 14 августа 2019, 11:22 +03:00 от Alexei 
> > > > > > > > > > > > > > Scherbakov <
> > > > > > > > > > > [email protected] >:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Denis Mekhanikov,
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently metadata are fsync'ed on write. This 
> > > > > > > > > > > > > > might be the
> > case
> > > > > > of
> > > > > > > > > > > > > > slow-downs in case of metadata burst writes.
> > > > > > > > > > > > > > I think removing fsync could help to mitigate 
> > > > > > > > > > > > > > performance
> > issues
> > > > > > > > with
> > > > > > > > > > > > > > current implementation until proper solution will be
> > > > implemented:
> > > > > > > > > > > moving
> > > > > > > > > > > > > > metadata to metastore.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
> > > > > > > > [email protected]
> > > > > > > > > > > > :
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I would also like to mention, that marshaller 
> > > > > > > > > > > > > > > mappings are
> > > > > > written
> > > > > > > > to
> > > > > > > > > > > disk
> > > > > > > > > > > > > > > even if persistence is disabled.
> > > > > > > > > > > > > > > So, this issue affects purely in-memory clusters 
> > > > > > > > > > > > > > > as well.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On 13 Aug 2019, at 17:06, Denis Mekhanikov <
> > > > > > > > [email protected] >
> > > > > > > > > > > > > > > wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Hi!
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > When persistence is enabled, binary metadata is 
> > > > > > > > > > > > > > > > written to
> > > > disk
> > > > > > > > upon
> > > > > > > > > > > > > > > registration. Currently it happens in the 
> > > > > > > > > > > > > > > discovery thread,
> > > > which
> > > > > > > > > > > makes
> > > > > > > > > > > > > > > processing of related messages very slow.
> > > > > > > > > > > > > > > > There are cases, when a lot of nodes and slow 
> > > > > > > > > > > > > > > > disks can make
> > > > > > every
> > > > > > > > > > > > > > > binary type be registered for several minutes. 
> > > > > > > > > > > > > > > Plus it blocks
> > > > > > > > > > > processing of
> > > > > > > > > > > > > > > other messages.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I propose starting a separate thread that will 
> > > > > > > > > > > > > > > > be
> > responsible
> > > > > > for
> > > > > > > > > > > > > > > writing binary metadata to disk. So, binary type 
> > > > > > > > > > > > > > > registration
> > > > > > will
> > > > > > > > be
> > > > > > > > > > > > > > > considered finished before information about it 
> > > > > > > > > > > > > > > will is
> > written
> > > > > > to
> > > > > > > > > > > disks on
> > > > > > > > > > > > > > > all nodes.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The main concern here is data consistency in 
> > > > > > > > > > > > > > > > cases when a
> > node
> > > > > > > > > > > > > > > acknowledges type registration and then fails 
> > > > > > > > > > > > > > > before writing
> > > > the
> > > > > > > > > > > metadata
> > > > > > > > > > > > > > > to disk.
> > > > > > > > > > > > > > > > I see two parts of this issue:
> > > > > > > > > > > > > > > > Nodes will have different metadata after 
> > > > > > > > > > > > > > > > restarting.
> > > > > > > > > > > > > > > > If we write some data into a persisted cache 
> > > > > > > > > > > > > > > > and shut down
> > > > nodes
> > > > > > > > > > > faster
> > > > > > > > > > > > > > > than a new binary type is written to disk, then 
> > > > > > > > > > > > > > > after a
> > restart
> > > > > > we
> > > > > > > > > > > won’t
> > > > > > > > > > > > > > > have a binary type to work with.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The first case is similar to a situation, when 
> > > > > > > > > > > > > > > > one node
> > fails,
> > > > > > and
> > > > > > > > > > > after
> > > > > > > > > > > > > > > that a new type is registered in the cluster. 
> > > > > > > > > > > > > > > This issue is
> > > > > > > > resolved
> > > > > > > > > > > by the
> > > > > > > > > > > > > > > discovery data exchange. All nodes receive 
> > > > > > > > > > > > > > > information about
> > > > all
> > > > > > > > > > > binary
> > > > > > > > > > > > > > > types in the initial discovery messages sent by 
> > > > > > > > > > > > > > > other nodes.
> > > > So,
> > > > > > > > once
> > > > > > > > > > > you
> > > > > > > > > > > > > > > restart a node, it will receive information, that 
> > > > > > > > > > > > > > > it failed
> > to
> > > > > > > > finish
> > > > > > > > > > > > > > > writing to disk, from other nodes.
> > > > > > > > > > > > > > > > If all nodes shut down before finishing writing 
> > > > > > > > > > > > > > > > the metadata
> > > > to
> > > > > > > > disk,
> > > > > > > > > > > > > > > then after a restart the type will be considered
> > unregistered,
> > > > so
> > > > > > > > > > > another
> > > > > > > > > > > > > > > registration will be required.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The second case is a bit more complicated. But 
> > > > > > > > > > > > > > > > it can be
> > > > > > resolved
> > > > > > > > by
> > > > > > > > > > > > > > > making the discovery threads on every node create 
> > > > > > > > > > > > > > > a future,
> > > > that
> > > > > > > > will
> > > > > > > > > > > be
> > > > > > > > > > > > > > > completed when writing to disk is finished. So, 
> > > > > > > > > > > > > > > every node
> > will
> > > > > > > > have
> > > > > > > > > > > such
> > > > > > > > > > > > > > > future, that will reflect the current state of 
> > > > > > > > > > > > > > > persisting the
> > > > > > > > > > > metadata to
> > > > > > > > > > > > > > > disk.
> > > > > > > > > > > > > > > > After that, if some operation needs this binary 
> > > > > > > > > > > > > > > > type, it
> > will
> > > > > > > > need to
> > > > > > > > > > > > > > > wait on that future until flushing to disk is 
> > > > > > > > > > > > > > > finished.
> > > > > > > > > > > > > > > > This way discovery threads won’t be blocked, 
> > > > > > > > > > > > > > > > but other
> > > > threads,
> > > > > > > > that
> > > > > > > > > > > > > > > actually need this type, will be.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Please let me know what you think about that.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Denis
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > --
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Best regards,
> > > > > > > > > > > > > > Alexei Scherbakov
> > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > --
> > > > > > > > > > > > > Zhenya Stanilovsky
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > --
> > > > > > > > > > > > Best regards,
> > > > > > > > > > > > Ivan Pavlukhin
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > > > Best regards,
> > > > > > > > > > Alexei Scherbakov
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Zhenya Stanilovsky
> > > > > > > >
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > >
> > > Best regards,
> > > Alexei Scherbakov
> >
> >
>
> --
>
> Best regards,
> Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Reply via email to