Alexey, I’m not suggesting to duplicate anything. My point is that the proper fix will be implemented in a relatively distant future. Why not improve the existing mechanism now instead of waiting for the proper fix? If we don’t agree on doing this fix in master, I can do it in a fork and use it in my setup. So please let me know if you see any other drawbacks in the proposed solution.
Denis > On 21 Aug 2019, at 15:53, Alexei Scherbakov <alexey.scherbak...@gmail.com> > wrote: > > Denis Mekhanikov, > > If we are still talking about "proper" solution the metastore (I've meant > of course distributed one) is the way to go. > > It has a contract to store cluster wide metadata in most efficient way and > can have any optimization for concurrent writing inside. > > I'm against creating some duplicating mechanism as you suggested. We do not > need another copy/paste code. > > Another possibility is to carry metadata along with appropriate request if > it's not found locally but this is a rather big modification. > > > > вт, 20 авг. 2019 г. в 17:26, Denis Mekhanikov <dmekhani...@gmail.com>: > >> Eduard, >> >> Usages will wait for the metadata to be registered and written to disk. No >> races should occur with such flow. >> Or do you have some specific case on your mind? >> >> I agree, that using a distributed meta storage would be nice here. >> But this way we will kind of move to the previous scheme with a replicated >> system cache, where metadata was stored before. >> Will scheme with the metastorage be different in any way? Won’t we decide >> to move back to discovery messages again after a while? >> >> Denis >> >> >>> On 20 Aug 2019, at 15:13, Eduard Shangareev <eduard.shangar...@gmail.com> >> wrote: >>> >>> Denis, >>> How would we deal with races between registration and metadata usages >> with >>> such fast-fix? >>> >>> I believe, that we need to move it to distributed metastorage, and await >>> registration completeness if we can't find it (wait for work in >> progress). >>> Discovery shouldn't wait for anything here. >>> >>> On Tue, Aug 20, 2019 at 11:55 AM Denis Mekhanikov <dmekhani...@gmail.com >>> >>> wrote: >>> >>>> Sergey, >>>> >>>> Currently metadata is written to disk sequentially on every node. Only >> one >>>> node at a time is able to write metadata to its storage. >>>> Slowness accumulates when you add more nodes. A delay required to write >>>> one piece of metadata may be not that big, but if you multiply it by say >>>> 200, then it becomes noticeable. >>>> But If we move the writing out from discovery threads, then nodes will >> be >>>> doing it in parallel. >>>> >>>> I think, it’s better to block some threads from a striped pool for a >>>> little while rather than blocking discovery for the same period, but >>>> multiplied by a number of nodes. >>>> >>>> What do you think? >>>> >>>> Denis >>>> >>>>> On 15 Aug 2019, at 10:26, Sergey Chugunov <sergey.chugu...@gmail.com> >>>> wrote: >>>>> >>>>> Denis, >>>>> >>>>> Thanks for bringing this issue up, decision to write binary metadata >> from >>>>> discovery thread was really a tough decision to make. >>>>> I don't think that moving metadata to metastorage is a silver bullet >> here >>>>> as this approach also has its drawbacks and is not an easy change. >>>>> >>>>> In addition to workarounds suggested by Alexei we have two choices to >>>>> offload write operation from discovery thread: >>>>> >>>>> 1. Your scheme with a separate writer thread and futures completed >> when >>>>> write operation is finished. >>>>> 2. PME-like protocol with obvious complications like failover and >>>>> asynchronous wait for replies over communication layer. >>>>> >>>>> Your suggestion looks easier from code complexity perspective but in my >>>>> view it increases chances to get into starvation. Now if some node >> faces >>>>> really long delays during write op it is gonna be kicked out of >> topology >>>> by >>>>> discovery protocol. In your case it is possible that more and more >>>> threads >>>>> from other pools may stuck waiting on the operation future, it is also >>>> not >>>>> good. >>>>> >>>>> What do you think? >>>>> >>>>> I also think that if we want to approach this issue systematically, we >>>> need >>>>> to do a deep analysis of metastorage option as well and to finally >> choose >>>>> which road we wanna go. >>>>> >>>>> Thanks! >>>>> >>>>> On Thu, Aug 15, 2019 at 9:28 AM Zhenya Stanilovsky >>>>> <arzamas...@mail.ru.invalid> wrote: >>>>> >>>>>> >>>>>>> >>>>>>>> 1. Yes, only on OS failures. In such case data will be received from >>>>>> alive >>>>>>>> nodes later. >>>>>> What behavior would be in case of one node ? I suppose someone can >>>> obtain >>>>>> cache data without unmarshalling schema, what in this case would be >> with >>>>>> grid operability? >>>>>> >>>>>>> >>>>>>>> 2. Yes, for walmode=FSYNC writes to metastore will be slow. But such >>>>>> mode >>>>>>>> should not be used if you have more than two nodes in grid because >> it >>>>>> has >>>>>>>> huge impact on performance. >>>>>> Is wal mode affects metadata store ? >>>>>> >>>>>>> >>>>>>>> >>>>>>>> ср, 14 авг. 2019 г. в 14:29, Denis Mekhanikov < >> dmekhani...@gmail.com >>>>>>> : >>>>>>>> >>>>>>>>> Folks, >>>>>>>>> >>>>>>>>> Thanks for showing interest in this issue! >>>>>>>>> >>>>>>>>> Alexey, >>>>>>>>> >>>>>>>>>> I think removing fsync could help to mitigate performance issues >>>> with >>>>>>>>> current implementation >>>>>>>>> >>>>>>>>> Is my understanding correct, that if we remove fsync, then >> discovery >>>>>> won’t >>>>>>>>> be blocked, and data will be flushed to disk in background, and >> loss >>>> of >>>>>>>>> information will be possible only on OS failure? It sounds like an >>>>>>>>> acceptable workaround to me. >>>>>>>>> >>>>>>>>> Will moving metadata to metastore actually resolve this issue? >> Please >>>>>>>>> correct me if I’m wrong, but we will still need to write the >>>>>> information to >>>>>>>>> WAL before releasing the discovery thread. If WAL mode is FSYNC, >> then >>>>>> the >>>>>>>>> issue will still be there. Or is it planned to abandon the >>>>>> discovery-based >>>>>>>>> protocol at all? >>>>>>>>> >>>>>>>>> Evgeniy, Ivan, >>>>>>>>> >>>>>>>>> In my particular case the data wasn’t too big. It was a slow >>>>>> virtualised >>>>>>>>> disk with encryption, that made operations slow. Given that there >> are >>>>>> 200 >>>>>>>>> nodes in a cluster, where every node writes slowly, and this >> process >>>> is >>>>>>>>> sequential, one piece of metadata is registered extremely slowly. >>>>>>>>> >>>>>>>>> Ivan, answering to your other questions: >>>>>>>>> >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it >>>> so >>>>>>>>> accidentally? >>>>>>>>> >>>>>>>>> It should be checked, if it’s safe to stop writing marshaller >>>> mappings >>>>>> to >>>>>>>>> disk without loosing any guarantees. >>>>>>>>> But anyway, I would like to have a property, that would control >> this. >>>>>> If >>>>>>>>> metadata registration is slow, then initial cluster warmup may >> take a >>>>>>>>> while. So, if we preserve metadata on disk, then we will need to >> warm >>>>>> it up >>>>>>>>> only once, and further restarts won’t be affected. >>>>>>>>> >>>>>>>>>> Do we really need a fast fix here? >>>>>>>>> >>>>>>>>> I would like a fix, that could be implemented now, since the >> activity >>>>>> with >>>>>>>>> moving metadata to metastore doesn’t sound like a quick one. >> Having a >>>>>>>>> temporary solution would be nice. >>>>>>>>> >>>>>>>>> Denis >>>>>>>>> >>>>>>>>>> On 14 Aug 2019, at 11:53, Павлухин Иван < vololo...@gmail.com > >>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Denis, >>>>>>>>>> >>>>>>>>>> Several clarifying questions: >>>>>>>>>> 1. Do you have an idea why metadata registration takes so long? So >>>>>>>>>> poor disks? So many data to write? A contention with disk writes >> by >>>>>>>>>> other subsystems? >>>>>>>>>> 2. Do we need a persistent metadata for in-memory caches? Or is it >>>> so >>>>>>>>>> accidentally? >>>>>>>>>> >>>>>>>>>> Generally, I think that it is possible to move metadata saving >>>>>>>>>> operations out of discovery thread without loosing required >>>>>>>>>> consistency/integrity. >>>>>>>>>> >>>>>>>>>> As Alex mentioned using metastore looks like a better solution. Do >>>> we >>>>>>>>>> really need a fast fix here? (Are we talking about fast fix?) >>>>>>>>>> >>>>>>>>>> ср, 14 авг. 2019 г. в 11:45, Zhenya Stanilovsky >>>>>>>>> < arzamas...@mail.ru.invalid >: >>>>>>>>>>> >>>>>>>>>>> Alexey, but in this case customer need to be informed, that whole >>>>>> (for >>>>>>>>> example 1 node) cluster crash (power off) could lead to partial >> data >>>>>>>>> unavailability. >>>>>>>>>>> And may be further index corruption. >>>>>>>>>>> 1. Why your meta takes a substantial size? may be context >> leaking ? >>>>>>>>>>> 2. Could meta be compressed ? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>>> Среда, 14 августа 2019, 11:22 +03:00 от Alexei Scherbakov < >>>>>>>>> alexey.scherbak...@gmail.com >: >>>>>>>>>>>> >>>>>>>>>>>> Denis Mekhanikov, >>>>>>>>>>>> >>>>>>>>>>>> Currently metadata are fsync'ed on write. This might be the case >>>> of >>>>>>>>>>>> slow-downs in case of metadata burst writes. >>>>>>>>>>>> I think removing fsync could help to mitigate performance issues >>>>>> with >>>>>>>>>>>> current implementation until proper solution will be >> implemented: >>>>>>>>> moving >>>>>>>>>>>> metadata to metastore. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> вт, 13 авг. 2019 г. в 17:09, Denis Mekhanikov < >>>>>> dmekhani...@gmail.com >>>>>>>>>> : >>>>>>>>>>>> >>>>>>>>>>>>> I would also like to mention, that marshaller mappings are >>>> written >>>>>> to >>>>>>>>> disk >>>>>>>>>>>>> even if persistence is disabled. >>>>>>>>>>>>> So, this issue affects purely in-memory clusters as well. >>>>>>>>>>>>> >>>>>>>>>>>>> Denis >>>>>>>>>>>>> >>>>>>>>>>>>>> On 13 Aug 2019, at 17:06, Denis Mekhanikov < >>>>>> dmekhani...@gmail.com > >>>>>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi! >>>>>>>>>>>>>> >>>>>>>>>>>>>> When persistence is enabled, binary metadata is written to >> disk >>>>>> upon >>>>>>>>>>>>> registration. Currently it happens in the discovery thread, >> which >>>>>>>>> makes >>>>>>>>>>>>> processing of related messages very slow. >>>>>>>>>>>>>> There are cases, when a lot of nodes and slow disks can make >>>> every >>>>>>>>>>>>> binary type be registered for several minutes. Plus it blocks >>>>>>>>> processing of >>>>>>>>>>>>> other messages. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I propose starting a separate thread that will be responsible >>>> for >>>>>>>>>>>>> writing binary metadata to disk. So, binary type registration >>>> will >>>>>> be >>>>>>>>>>>>> considered finished before information about it will is written >>>> to >>>>>>>>> disks on >>>>>>>>>>>>> all nodes. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The main concern here is data consistency in cases when a node >>>>>>>>>>>>> acknowledges type registration and then fails before writing >> the >>>>>>>>> metadata >>>>>>>>>>>>> to disk. >>>>>>>>>>>>>> I see two parts of this issue: >>>>>>>>>>>>>> Nodes will have different metadata after restarting. >>>>>>>>>>>>>> If we write some data into a persisted cache and shut down >> nodes >>>>>>>>> faster >>>>>>>>>>>>> than a new binary type is written to disk, then after a restart >>>> we >>>>>>>>> won’t >>>>>>>>>>>>> have a binary type to work with. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The first case is similar to a situation, when one node fails, >>>> and >>>>>>>>> after >>>>>>>>>>>>> that a new type is registered in the cluster. This issue is >>>>>> resolved >>>>>>>>> by the >>>>>>>>>>>>> discovery data exchange. All nodes receive information about >> all >>>>>>>>> binary >>>>>>>>>>>>> types in the initial discovery messages sent by other nodes. >> So, >>>>>> once >>>>>>>>> you >>>>>>>>>>>>> restart a node, it will receive information, that it failed to >>>>>> finish >>>>>>>>>>>>> writing to disk, from other nodes. >>>>>>>>>>>>>> If all nodes shut down before finishing writing the metadata >> to >>>>>> disk, >>>>>>>>>>>>> then after a restart the type will be considered unregistered, >> so >>>>>>>>> another >>>>>>>>>>>>> registration will be required. >>>>>>>>>>>>>> >>>>>>>>>>>>>> The second case is a bit more complicated. But it can be >>>> resolved >>>>>> by >>>>>>>>>>>>> making the discovery threads on every node create a future, >> that >>>>>> will >>>>>>>>> be >>>>>>>>>>>>> completed when writing to disk is finished. So, every node will >>>>>> have >>>>>>>>> such >>>>>>>>>>>>> future, that will reflect the current state of persisting the >>>>>>>>> metadata to >>>>>>>>>>>>> disk. >>>>>>>>>>>>>> After that, if some operation needs this binary type, it will >>>>>> need to >>>>>>>>>>>>> wait on that future until flushing to disk is finished. >>>>>>>>>>>>>> This way discovery threads won’t be blocked, but other >> threads, >>>>>> that >>>>>>>>>>>>> actually need this type, will be. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Please let me know what you think about that. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Denis >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> >>>>>>>>>>>> Best regards, >>>>>>>>>>>> Alexei Scherbakov >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Zhenya Stanilovsky >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Best regards, >>>>>>>>>> Ivan Pavlukhin >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Alexei Scherbakov >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Zhenya Stanilovsky >>>>>> >>>> >>>> >> >> > > -- > > Best regards, > Alexei Scherbakov