Re: Asynchronous registration of binary metadata

Denis Mekhanikov Wed, 21 Aug 2019 09:48:41 -0700

Alexey,

I’m not suggesting to duplicate anything.
My point is that the proper fix will be implemented in a relatively distant 
future. Why not improve the existing mechanism now instead of waiting for the 
proper fix?
If we don’t agree on doing this fix in master, I can do it in a fork and use it 
in my setup. So please let me know if you see any other drawbacks in the 
proposed solution.


Denis

> On 21 Aug 2019, at 15:53, Alexei Scherbakov <alexey.scherbak...@gmail.com> 
> wrote:
> 
> Denis Mekhanikov,
> 
> If we are still talking about "proper" solution the metastore (I've meant
> of course distributed one) is the way to go.
> 
> It has a contract to store cluster wide metadata in most efficient way and
> can have any optimization for concurrent writing inside.
> 
> I'm against creating some duplicating mechanism as you suggested. We do not
> need another copy/paste code.
> 
> Another possibility is to carry metadata along with appropriate request if
> it's not found locally but this is a rather big modification.
> 
> 
> 
> вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhani...@gmail.com>:
> 
>> Eduard,
>> 
>> Usages will wait for the metadata to be registered and written to disk. No
>> races should occur with such flow.
>> Or do you have some specific case on your mind?
>> 
>> I agree, that using a distributed meta storage would be nice here.
>> But this way we will kind of move to the previous scheme with a replicated
>> system cache, where metadata was stored before.
>> Will scheme with the metastorage be different in any way? Won’t we decide
>> to move back to discovery messages again after a while?
>> 
>> Denis
>> 
>> 
>>> On 20 Aug 2019, at 15:13, Eduard Shangareev <eduard.shangar...@gmail.com>
>> wrote:
>>> 
>>> Denis,
>>> How would we deal with races between registration and metadata usages
>> with
>>> such fast-fix?
>>> 
>>> I believe, that we need to move it to distributed metastorage, and await
>>> registration completeness if we can't find it (wait for work in
>> progress).
>>> Discovery shouldn't wait for anything here.
>>> 
>>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <dmekhani...@gmail.com
>>> 
>>> wrote:
>>> 
>>>> Sergey,
>>>> 
>>>> Currently metadata is written to disk sequentially on every node. Only
>> one
>>>> node at a time is able to write metadata to its storage.
>>>> Slowness accumulates when you add more nodes. A delay required to write
>>>> one piece of metadata may be not that big, but if you multiply it by say
>>>> 200, then it becomes noticeable.
>>>> But If we move the writing out from discovery threads, then nodes will
>> be
>>>> doing it in parallel.
>>>> 
>>>> I think, it’s better to block some threads from a striped pool for a
>>>> little while rather than blocking discovery for the same period, but
>>>> multiplied by a number of nodes.
>>>> 
>>>> What do you think?
>>>> 
>>>> Denis
>>>> 
>>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com>
>>>> wrote:
>>>>> 
>>>>> Denis,
>>>>> 
>>>>> Thanks for bringing this issue up, decision to write binary metadata
>> from
>>>>> discovery thread was really a tough decision to make.
>>>>> I don't think that moving metadata to metastorage is a silver bullet
>> here
>>>>> as this approach also has its drawbacks and is not an easy change.
>>>>> 
>>>>> In addition to workarounds suggested by Alexei we have two choices to
>>>>> offload write operation from discovery thread:
>>>>> 
>>>>> 1. Your scheme with a separate writer thread and futures completed
>> when
>>>>> write operation is finished.
>>>>> 2. PME-like protocol with obvious complications like failover and
>>>>> asynchronous wait for replies over communication layer.
>>>>> 
>>>>> Your suggestion looks easier from code complexity perspective but in my
>>>>> view it increases chances to get into starvation. Now if some node
>> faces
>>>>> really long delays during write op it is gonna be kicked out of
>> topology
>>>> by
>>>>> discovery protocol. In your case it is possible that more and more
>>>> threads
>>>>> from other pools may stuck waiting on the operation future, it is also
>>>> not
>>>>> good.
>>>>> 
>>>>> What do you think?
>>>>> 
>>>>> I also think that if we want to approach this issue systematically, we
>>>> need
>>>>> to do a deep analysis of metastorage option as well and to finally
>> choose
>>>>> which road we wanna go.
>>>>> 
>>>>> Thanks!
>>>>> 
>>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky
>>>>> <arzamas...@mail.ru.invalid> wrote:
>>>>> 
>>>>>> 
>>>>>>> 
>>>>>>>> 1. Yes, only on OS failures. In such case data will be received from
>>>>>> alive
>>>>>>>> nodes later.
>>>>>> What behavior would be in case of one node ? I suppose someone can
>>>> obtain
>>>>>> cache data without unmarshalling schema, what in this case would be
>> with
>>>>>> grid operability?
>>>>>> 
>>>>>>> 
>>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such
>>>>>> mode
>>>>>>>> should not be used if you have more than two nodes in grid because
>> it
>>>>>> has
>>>>>>>> huge impact on performance.
>>>>>> Is wal mode affects metadata store ?
>>>>>> 
>>>>>>> 
>>>>>>>> 
>>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov <
>> dmekhani...@gmail.com
>>>>>>> :
>>>>>>>> 
>>>>>>>>> Folks,
>>>>>>>>> 
>>>>>>>>> Thanks for showing interest in this issue!
>>>>>>>>> 
>>>>>>>>> Alexey,
>>>>>>>>> 
>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>> with
>>>>>>>>> current implementation
>>>>>>>>> 
>>>>>>>>> Is my understanding correct, that if we remove fsync, then
>> discovery
>>>>>> won’t
>>>>>>>>> be blocked, and data will be flushed to disk in background, and
>> loss
>>>> of
>>>>>>>>> information will be possible only on OS failure? It sounds like an
>>>>>>>>> acceptable workaround to me.
>>>>>>>>> 
>>>>>>>>> Will moving metadata to metastore actually resolve this issue?
>> Please
>>>>>>>>> correct me if I’m wrong, but we will still need to write the
>>>>>> information to
>>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC,
>> then
>>>>>> the
>>>>>>>>> issue will still be there. Or is it planned to abandon the
>>>>>> discovery-based
>>>>>>>>> protocol at all?
>>>>>>>>> 
>>>>>>>>> Evgeniy, Ivan,
>>>>>>>>> 
>>>>>>>>> In my particular case the data wasn’t too big. It was a slow
>>>>>> virtualised
>>>>>>>>> disk with encryption, that made operations slow. Given that there
>> are
>>>>>> 200
>>>>>>>>> nodes in a cluster, where every node writes slowly, and this
>> process
>>>> is
>>>>>>>>> sequential, one piece of metadata is registered extremely slowly.
>>>>>>>>> 
>>>>>>>>> Ivan, answering to your other questions:
>>>>>>>>> 
>>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>>>> so
>>>>>>>>> accidentally?
>>>>>>>>> 
>>>>>>>>> It should be checked, if it’s safe to stop writing marshaller
>>>> mappings
>>>>>> to
>>>>>>>>> disk without loosing any guarantees.
>>>>>>>>> But anyway, I would like to have a property, that would control
>> this.
>>>>>> If
>>>>>>>>> metadata registration is slow, then initial cluster warmup may
>> take a
>>>>>>>>> while. So, if we preserve metadata on disk, then we will need to
>> warm
>>>>>> it up
>>>>>>>>> only once, and further restarts won’t be affected.
>>>>>>>>> 
>>>>>>>>>> Do we really need a fast fix here?
>>>>>>>>> 
>>>>>>>>> I would like a fix, that could be implemented now, since the
>> activity
>>>>>> with
>>>>>>>>> moving metadata to metastore doesn’t sound like a quick one.
>> Having a
>>>>>>>>> temporary solution would be nice.
>>>>>>>>> 
>>>>>>>>> Denis
>>>>>>>>> 
>>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com >
>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>> Denis,
>>>>>>>>>> 
>>>>>>>>>> Several clarifying questions:
>>>>>>>>>> 1. Do you have an idea why metadata registration takes so long? So
>>>>>>>>>> poor disks? So many data to write? A contention with disk writes
>> by
>>>>>>>>>> other subsystems?
>>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it
>>>> so
>>>>>>>>>> accidentally?
>>>>>>>>>> 
>>>>>>>>>> Generally, I think that it is possible to move metadata saving
>>>>>>>>>> operations out of discovery thread without loosing required
>>>>>>>>>> consistency/integrity.
>>>>>>>>>> 
>>>>>>>>>> As Alex mentioned using metastore looks like a better solution. Do
>>>> we
>>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?)
>>>>>>>>>> 
>>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky
>>>>>>>>> < arzamas...@mail.ru.invalid >:
>>>>>>>>>>> 
>>>>>>>>>>> Alexey, but in this case customer need to be informed, that whole
>>>>>> (for
>>>>>>>>> example 1 node) cluster crash (power off) could lead to partial
>> data
>>>>>>>>> unavailability.
>>>>>>>>>>> And may be further index corruption.
>>>>>>>>>>> 1. Why your meta takes a substantial size? may be context
>> leaking ?
>>>>>>>>>>> 2. Could meta be compressed ?
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov <
>>>>>>>>> alexey.scherbak...@gmail.com >:
>>>>>>>>>>>> 
>>>>>>>>>>>> Denis Mekhanikov,
>>>>>>>>>>>> 
>>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case
>>>> of
>>>>>>>>>>>> slow-downs in case of metadata burst writes.
>>>>>>>>>>>> I think removing fsync could help to mitigate performance issues
>>>>>> with
>>>>>>>>>>>> current implementation until proper solution will be
>> implemented:
>>>>>>>>> moving
>>>>>>>>>>>> metadata to metastore.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov <
>>>>>> dmekhani...@gmail.com
>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> I would also like to mention, that marshaller mappings are
>>>> written
>>>>>> to
>>>>>>>>> disk
>>>>>>>>>>>>> even if persistence is disabled.
>>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov <
>>>>>> dmekhani...@gmail.com >
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Hi!
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to
>> disk
>>>>>> upon
>>>>>>>>>>>>> registration. Currently it happens in the discovery thread,
>> which
>>>>>>>>> makes
>>>>>>>>>>>>> processing of related messages very slow.
>>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make
>>>> every
>>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks
>>>>>>>>> processing of
>>>>>>>>>>>>> other messages.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I propose starting a separate thread that will be responsible
>>>> for
>>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration
>>>> will
>>>>>> be
>>>>>>>>>>>>> considered finished before information about it will is written
>>>> to
>>>>>>>>> disks on
>>>>>>>>>>>>> all nodes.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The main concern here is data consistency in cases when a node
>>>>>>>>>>>>> acknowledges type registration and then fails before writing
>> the
>>>>>>>>> metadata
>>>>>>>>>>>>> to disk.
>>>>>>>>>>>>>> I see two parts of this issue:
>>>>>>>>>>>>>> Nodes will have different metadata after restarting.
>>>>>>>>>>>>>> If we write some data into a persisted cache and shut down
>> nodes
>>>>>>>>> faster
>>>>>>>>>>>>> than a new binary type is written to disk, then after a restart
>>>> we
>>>>>>>>> won’t
>>>>>>>>>>>>> have a binary type to work with.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The first case is similar to a situation, when one node fails,
>>>> and
>>>>>>>>> after
>>>>>>>>>>>>> that a new type is registered in the cluster. This issue is
>>>>>> resolved
>>>>>>>>> by the
>>>>>>>>>>>>> discovery data exchange. All nodes receive information about
>> all
>>>>>>>>> binary
>>>>>>>>>>>>> types in the initial discovery messages sent by other nodes.
>> So,
>>>>>> once
>>>>>>>>> you
>>>>>>>>>>>>> restart a node, it will receive information, that it failed to
>>>>>> finish
>>>>>>>>>>>>> writing to disk, from other nodes.
>>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata
>> to
>>>>>> disk,
>>>>>>>>>>>>> then after a restart the type will be considered unregistered,
>> so
>>>>>>>>> another
>>>>>>>>>>>>> registration will be required.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The second case is a bit more complicated. But it can be
>>>> resolved
>>>>>> by
>>>>>>>>>>>>> making the discovery threads on every node create a future,
>> that
>>>>>> will
>>>>>>>>> be
>>>>>>>>>>>>> completed when writing to disk is finished. So, every node will
>>>>>> have
>>>>>>>>> such
>>>>>>>>>>>>> future, that will reflect the current state of persisting the
>>>>>>>>> metadata to
>>>>>>>>>>>>> disk.
>>>>>>>>>>>>>> After that, if some operation needs this binary type, it will
>>>>>> need to
>>>>>>>>>>>>> wait on that future until flushing to disk is finished.
>>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other
>> threads,
>>>>>> that
>>>>>>>>>>>>> actually need this type, will be.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Please let me know what you think about that.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Denis
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>> Alexei Scherbakov
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Zhenya Stanilovsky
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> Best regards,
>>>>>>>>>> Ivan Pavlukhin
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> 
>>>>>>>> Best regards,
>>>>>>>> Alexei Scherbakov
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Zhenya Stanilovsky
>>>>>> 
>>>> 
>>>> 
>> 
>> 
> 
> -- 
> 
> Best regards,
> Alexei Scherbakov

Re: Asynchronous registration of binary metadata

Reply via email to