> Beside that, I have one general question - What is the major goal for this > feature? Are you targeting on building a general XA transaction coordinator > or just for supporting things like `copy-modify-write' style workflow?
The use case I would have for transactions - at some level of the stack - is supporting dynamic configurations. If a config changes in e.g. three lines, some of the changes may logically belong together. E.g. changing both “host” and “port” (if separate entries). One shouldn’t be able to read a state, even temporarily, that has new host but old port. I can do this in the application level - it does not need to be part of the DL protocol. Asko Kauppi Zalando Tech Helsinki > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote: > > Sorry for late response. I think Leigh and you already had some very > valuable discussions in the doc. I will try to add some of my questions to > the discussion. > > Beside that, I had a discussion with Leigh today about this. first of all, > I think it is very good to add transaction support in distributedlog. It is > one of the primitives that would help building distributed service. But we > have a concern about making this system become complicated and introduce > operational overhead when it runs in the large scale system on production. > There are two major suggestions that I have for this feature - > > Build the 'minimum' logic in core - I think the minimum logic that need to > be added to the core is - the special control records (begin, commit and > abort) and make the reader be able to detect those special control records > and know what do they mean and how to interrupt with them. Since they are > special control records, there is not overhead to other readers that > doesn't require this feature. > > Build the transaction coordinator as a separated proxy service - I think > the major concern that we have is putting more complexities into the 'write > proxy' service. We architected distributedlog in a more microservice-like > way - we have the core as the stream store, the proxy for serving write and > read traffic. It would be good that the transaction feature can be done in > a similar way. So the architecture would be like this - > > *[ write service ] [ read service ] [ transaction coordinator ]* > *[ stream store > ]* > > if people doesn't need the transaction feature, they can turn if off > completely without any operational overhead. > > Beside that, I have one general question - What is the major goal for this > feature? Are you targeting on building a general XA transaction coordinator > or just for supporting things like `copy-modify-write' style workflow? > > > Thanks, > Sijie > > > > > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu....@gmail.com> wrote: > >> Ping? >> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu....@gmail.com> wrote: >> >>> Sijie, >>> >>> No. I thought it might be easier for people to comment on a google doc to >>> gather the initial feedback. I will put the content back to wiki page >> once >>> addressing the comments. Does that sound good to you? >>> >>> And thank you in advance. >>> >>> - Xi >>> >>> >>> >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote: >>> >>>> Hi Xi, >>>> >>>> sorry for late response. I will review it soon. >>>> >>>> regarding this, a separate question "are we going to use google doc >>>> instead >>>> of email thread for any discussion"? I am a bit worried that the >>>> discussion >>>> will become lost after moving to google doc. No idea on how other apache >>>> projects are doing. >>>> >>>> - Sijie >>>> >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I finalized the first version of the design. This time I used a google >>>> doc >>>>> so that it is easier for commenting and add a link the wiki page. I >> will >>>>> update this to the wiki page once we come to the finalized design. >>>>> >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK >>>>> bSIGgSzXuTI5BA/edit >>>>> >>>>> Let me know if you have any questions. Appreciate your reviews! >>>>> >>>>> - Xi >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart >>>>> <lstew...@twitter.com.invalid >>>>>> wrote: >>>>> >>>>>> Interesting proposal. A couple quick notes while you continue to >> flesh >>>>> this >>>>>> out. >>>>>> >>>>>> a. just to be sure - does this eliminate the need to save seqno with >>>>>> checkpoint? >>>>>> >>>>>> b. i.e. another way to describe this kind of improvement is "support >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being >> it >>>>>> avoids the baggage of transactions. disadvantages include inability >>>> to do >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are >>>> there >>>>>> others?). >>>>>> >>>>>> c. proxy use case is for supporting multiple writers - have you >>>> thought >>>>>> about how this would work with multiple writers? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo >> <sij...@twitter.com.invalid >>>>> >>>>>> wrote: >>>>>> >>>>>>> Sound good to me. look forward to the detailed proposal. >>>>>>> >>>>>>> (I don't mind the format if it makes things easier to you) >>>>>>> >>>>>>> Sijie >>>>>>> >>>>>>> On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> wrote: >>>>>>> >>>>>>>> Thank you, Sijie >>>>>>>> >>>>>>>> We have some internal discussions to sort out some details. We >> are >>>>>> ready >>>>>>> to >>>>>>>> collaborate with the community for adding the transaction >> support >>>> in >>>>>> DL. >>>>>>>> We'd like to share more. >>>>>>>> >>>>>>>> I created a proposal wiki here - >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+ >>>>>>>> DistributedLog+Transaction+Support >>>>>>>> >>>>>>>> (I followed KIP format and named it as DP (DistributedLog >>>> Proposal - >>>>> DP >>>>>>> is >>>>>>>> also short for Dynamic Programming). I don't know if you guys >> like >>>>> this >>>>>>>> name or not. Feel free to change it :D) >>>>>>>> >>>>>>>> I basically put my initial email as the content there so far. >>>> Once we >>>>>>>> finished our final discussion, I will update with more details. >> At >>>>> the >>>>>>> same >>>>>>>> time, any comments are welcome. >>>>>>>> >>>>>>>> - Xi >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org >>>>>>> <javascript:;>> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Xi, >>>>>>>>> >>>>>>>>> I just granted you the edit permission. >>>>>>>>> >>>>>>>>> - Sijie >>>>>>>>> >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com >>>>>>>> <javascript:;>> wrote: >>>>>>>>> >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members >>>> grant >>>>> me >>>>>>> the >>>>>>>>>> permissions? >>>>>>>>>> >>>>>>>>>> - Xi >>>>>>>>>> >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu < >>>> xi.liu....@gmail.com >>>>>>>> <javascript:;>> wrote: >>>>>>>>>> >>>>>>>>>>> Sijie, >>>>>>>>>>> >>>>>>>>>>> I attempted to create a wiki page under that space. I >> found >>>>> that >>>>>> I >>>>>>> am >>>>>>>>> not >>>>>>>>>>> authorized with edit permission. >>>>>>>>>>> >>>>>>>>>>> Can any of the committers grant me the wiki edit >>>> permission? My >>>>>>>> account >>>>>>>>>> is >>>>>>>>>>> "xi.liu.ant". >>>>>>>>>>> >>>>>>>>>>> - Xi >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo < >>>> si...@apache.org >>>>>>>> <javascript:;>> wrote: >>>>>>>>>>> >>>>>>>>>>>> This sounds interesting ... I will take a closer look and >>>> give >>>>>> my >>>>>>>>>> comments >>>>>>>>>>>> later. >>>>>>>>>>>> >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put >>>> your >>>>>>> idea >>>>>>>>>> there? >>>>>>>>>>>> You can add your wiki page under >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+ >>>>>> Proposals >>>>>>>>>>>> >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki >>>> edit >>>>>>>>> permissions >>>>>>>>>>>> to >>>>>>>>>>>> you once you have a wiki account. >>>>>>>>>>>> >>>>>>>>>>>> - Sijie >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu < >>>> xi.liu....@gmail.com >>>>>>>> <javascript:;>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello, >>>>>>>>>>>>> >>>>>>>>>>>>> I asked the transaction support in distributedlog user >>>> group >>>>>> two >>>>>>>>>> months >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking >> for >>>>>> using >>>>>>>>>>>>> distributedlog for building a transactional data >>>> service. It >>>>>> is >>>>>>> a >>>>>>>>>> major >>>>>>>>>>>>> feature that is missing in distributedlog. We have some >>>>> ideas >>>>>> to >>>>>>>> add >>>>>>>>>>>> this >>>>>>>>>>>>> to distributedlog and want to know if they make sense >> or >>>>> not. >>>>>> If >>>>>>>>> they >>>>>>>>>>>> are >>>>>>>>>>>>> good, we'd like to contribute and develop with the >>>>> community. >>>>>>>>>>>>> >>>>>>>>>>>>> Here are the thoughts: >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once" >>>>>> delivery >>>>>>>>>> semantic >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once" >>>> delivery >>>>>>>>> semantic. >>>>>>>>>>>> That >>>>>>>>>>>>> means that a message can be delivered one or more times >>>> if >>>>> the >>>>>>>>> reader >>>>>>>>>>>>> doesn't handle duplicates. >>>>>>>>>>>>> >>>>>>>>>>>>> The duplicates come from two places, one is at writer >>>> side >>>>>> (this >>>>>>>>>> assumes >>>>>>>>>>>>> using write proxy not the core library), while the >> other >>>> one >>>>>> is >>>>>>> at >>>>>>>>>>>> reader >>>>>>>>>>>>> side. >>>>>>>>>>>>> >>>>>>>>>>>>> - writer side: if the client attempts to write a record >>>> to >>>>> the >>>>>>>> write >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then >>>>> retries, >>>>>>> the >>>>>>>>>>>> retrying >>>>>>>>>>>>> will potentially result in duplicates. >>>>>>>>>>>>> - reader side:if the reader reads a message from a >> stream >>>>> and >>>>>>> then >>>>>>>>>>>> crashes, >>>>>>>>>>>>> when the reader restarts it would restart from last >> known >>>>>>> position >>>>>>>>>>>> (DLSN). >>>>>>>>>>>>> If the reader fails after processing a record and >> before >>>>>>> recording >>>>>>>>> the >>>>>>>>>>>>> position, the processed record will be delivered again. >>>>>>>>>>>>> >>>>>>>>>>>>> The reader problem can be properly addressed by making >>>> use >>>>> of >>>>>>> the >>>>>>>>>>>> sequence >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For >>>>>> example, >>>>>>> in >>>>>>>>>>>>> database, it can checkpoint the indexed data with the >>>>> sequence >>>>>>>>> number >>>>>>>>>> of >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the >>>>>> sequence >>>>>>>>>>>> numbers. >>>>>>>>>>>>> >>>>>>>>>>>>> The writer problem can be addressed by implementing an >>>>>>> idempotent >>>>>>>>>>>> writer. >>>>>>>>>>>>> However, an alternative and more powerful approach is >> to >>>>>> support >>>>>>>>>>>>> transactions. >>>>>>>>>>>>> >>>>>>>>>>>>> *What does transaction mean?* >>>>>>>>>>>>> >>>>>>>>>>>>> A transaction means a collection of records can be >>>> written >>>>>>>>>>>> transactionally >>>>>>>>>>>>> within a stream or across multiple streams. They will >> be >>>>>>> consumed >>>>>>>> by >>>>>>>>>> the >>>>>>>>>>>>> reader together when a transaction is committed, or >> will >>>>> never >>>>>>> be >>>>>>>>>>>> consumed >>>>>>>>>>>>> by the reader when the transaction is aborted. >>>>>>>>>>>>> >>>>>>>>>>>>> The transaction will expose following guarantees: >>>>>>>>>>>>> >>>>>>>>>>>>> - The reader should not be exposed to records written >>>> from >>>>>>>>> uncommitted >>>>>>>>>>>>> transactions (mandatory) >>>>>>>>>>>>> - The reader should consume the records in the >>>> transaction >>>>>>> commit >>>>>>>>>> order >>>>>>>>>>>>> rather than the record written order (mandatory) >>>>>>>>>>>>> - No duplicated records within a transaction >> (mandatory) >>>>>>>>>>>>> - Allow interleaving transactional writes and >>>>>> non-transactional >>>>>>>>> writes >>>>>>>>>>>>> (optional) >>>>>>>>>>>>> >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction* >>>>>>>>>>>>> >>>>>>>>>>>>> There will be two types of transaction, one is Stream >>>> level >>>>>>>>>> transaction >>>>>>>>>>>>> (local transaction), while the other one is Namespace >>>> level >>>>>>>>>> transaction >>>>>>>>>>>>> (global transaction). >>>>>>>>>>>>> >>>>>>>>>>>>> The stream level transaction is a transactional >>>> operation on >>>>>>>> writing >>>>>>>>>>>>> records to one stream; the namespace level transaction >>>> is a >>>>>>>>>>>> transactional >>>>>>>>>>>>> operation on writing records to multiple streams. >>>>>>>>>>>>> >>>>>>>>>>>>> *Implementation Thoughts* >>>>>>>>>>>>> >>>>>>>>>>>>> - A transaction is consist of begin control record, a >>>> series >>>>>> of >>>>>>>> data >>>>>>>>>>>>> records and commit/abort control record. >>>>>>>>>>>>> - The begin/commit/abort control record is written to a >>>>>> `commit` >>>>>>>> log >>>>>>>>>>>>> stream, while the data records will be written to >> normal >>>>> data >>>>>>> log >>>>>>>>>>>> streams. >>>>>>>>>>>>> - The `commit` log stream will be the same log stream >> for >>>>>>>>> stream-level >>>>>>>>>>>>> transaction, while it will be a *system* stream (or >>>>> multiple >>>>>>>> system >>>>>>>>>>>>> streams) for namespace-level transactions. >>>>>>>>>>>>> - The transaction code looks like as below: >>>>>>>>>>>>> >>>>>>>>>>>>> <code> >>>>>>>>>>>>> >>>>>>>>>>>>> Transaction txn = client.transaction(); >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record); >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record); >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record); >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit(); >>>>>>>>>>>>> >>>>>>>>>>>>> </code> >>>>>>>>>>>>> >>>>>>>>>>>>> if the txn is committed, all the write futures will be >>>>>> satisfied >>>>>>>>> with >>>>>>>>>>>> their >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write >>>> futures >>>>>> will >>>>>>>> be >>>>>>>>>>>> failed >>>>>>>>>>>>> together. there is no partial failure state. >>>>>>>>>>>>> >>>>>>>>>>>>> - The actually data flow will be: >>>>>>>>>>>>> >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the >>>>> `commit' >>>>>>> log >>>>>>>>>> stream >>>>>>>>>>>>> 1. write the begin control record (synchronously) with >>>> the >>>>>>>>> transaction >>>>>>>>>>>> id >>>>>>>>>>>>> 2. for each write within the same txn, it will be >>>> assigned a >>>>>>> local >>>>>>>>>>>> sequence >>>>>>>>>>>>> number starting from 0. the combination of transaction >> id >>>>> and >>>>>>>> local >>>>>>>>>>>>> sequence number will be used later on by the readers to >>>>>>>> de-duplicate >>>>>>>>>>>>> records. >>>>>>>>>>>>> 3. the commit/abort control record will be written >> based >>>> on >>>>>> the >>>>>>>>>> results >>>>>>>>>>>>> from 2. >>>>>>>>>>>>> >>>>>>>>>>>>> - Application can supply a timeout for the transaction >>>> when >>>>>>>>> #begin() a >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can >>>> abort >>>>>>>>>> transactions >>>>>>>>>>>>> that never be committed/aborted within their timeout. >>>>>>>>>>>>> >>>>>>>>>>>>> - Failures: >>>>>>>>>>>>> >>>>>>>>>>>>> * all the log records can be simply retried as they >> will >>>> be >>>>>>>>>>>> de-duplicated >>>>>>>>>>>>> probably at the reader side. >>>>>>>>>>>>> >>>>>>>>>>>>> - Reader: >>>>>>>>>>>>> >>>>>>>>>>>>> * Reader can be configured to read uncommitted records >> or >>>>>>>> committed >>>>>>>>>>>> records >>>>>>>>>>>>> only (by default read uncommitted records) >>>>>>>>>>>>> * If reader is configured to read committed records >> only, >>>>> the >>>>>>> read >>>>>>>>>> ahead >>>>>>>>>>>>> cache will be changed to maintain one additional >> pending >>>>>>> committed >>>>>>>>>>>> records. >>>>>>>>>>>>> the pending committed records map is bounded and >> records >>>>> will >>>>>> be >>>>>>>>>> dropped >>>>>>>>>>>>> when read ahead is moving. >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind >> to >>>>> the >>>>>>>> begin >>>>>>>>>>>> record >>>>>>>>>>>>> and start reading from there. leveraging the proper >> read >>>>> ahead >>>>>>>> cache >>>>>>>>>> and >>>>>>>>>>>>> pending commit records cache, it would be good for both >>>>> short >>>>>>>>>>>> transactions >>>>>>>>>>>>> and long transactions. >>>>>>>>>>>>> >>>>>>>>>>>>> - DLSN, SequenceId: >>>>>>>>>>>>> >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local >>>> sequence >>>>>>>> number` >>>>>>>>>>>> within >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a >>>>>>> transaction >>>>>>>>>> will >>>>>>>>>>>> be >>>>>>>>>>>>> the DLSN of commit control record plus its local >> sequence >>>>>>> number. >>>>>>>>>>>>> * The sequence id will be still the position of the >>>> commit >>>>>>> record >>>>>>>>> plus >>>>>>>>>>>> its >>>>>>>>>>>>> local sequence number. The position will be advanced >> with >>>>>> total >>>>>>>>> number >>>>>>>>>>>> of >>>>>>>>>>>>> written records on writing the commit control record. >>>>>>>>>>>>> >>>>>>>>>>>>> - Transaction Group & Namespace Transaction >>>>>>>>>>>>> >>>>>>>>>>>>> using one single log stream for namespace transaction >> can >>>>>> cause >>>>>>>> the >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end >> control >>>>>>> records >>>>>>>>> will >>>>>>>>>>>> have >>>>>>>>>>>>> to go through one log stream. >>>>>>>>>>>>> >>>>>>>>>>>>> the idea of 'transaction group' is to allow >> partitioning >>>> the >>>>>>>> writers >>>>>>>>>>>> into >>>>>>>>>>>>> different transaction groups. >>>>>>>>>>>>> >>>>>>>>>>>>> clients can specify the `group-name` when starting the >>>>>>>> transaction. >>>>>>>>> if >>>>>>>>>>>>> there is no `group-name` specified, it will use the >>>> default >>>>>>>> `commit` >>>>>>>>>>>> log in >>>>>>>>>>>>> the namespace for creating transactions. >>>>>>>>>>>>> >>>>>>>>>>>>> ------------------------------------------------- >>>>>>>>>>>>> >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate >>>> any >>>>>>>> comments >>>>>>>>>> and >>>>>>>>>>>> if >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to >>>>>> collaborate >>>>>>>>> with >>>>>>>>>>>> the >>>>>>>>>>>>> community. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> - Xi >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>