On Wed, Jan 4, 2017 at 1:14 AM, Asko Kauppi <asko.kau...@zalando.fi> wrote:
> > Beside that, I have one general question - What is the major goal for > this > > feature? Are you targeting on building a general XA transaction > coordinator > > or just for supporting things like `copy-modify-write' style workflow? > > The use case I would have for transactions - at some level of the stack - > is supporting dynamic configurations. > > If a config changes in e.g. three lines, some of the changes may logically > belong together. E.g. changing both “host” and “port” (if separate > entries). One shouldn’t be able to read a state, even temporarily, that has > new host but old port. > > I can do this in the application level - it does not need to be part of > the DL protocol. > Yeah, I can see 'transaction' as a large atomic write in DL is very useful. Currently DL limits the record size to 1MB. If people wants to write a record larger than 1MB, it will potentially produce a `partial` write if application breaks their record into multiple records. Your use case falls into this category. I think the minimal support is large atomic write. That is good enough for most of the log use cases. Having a separated TC (transaction coordinator) is cool. but it can be an opt-in solution. > > > Asko Kauppi > Zalando Tech Helsinki > > > On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote: > > > > Sorry for late response. I think Leigh and you already had some very > > valuable discussions in the doc. I will try to add some of my questions > to > > the discussion. > > > > Beside that, I had a discussion with Leigh today about this. first of > all, > > I think it is very good to add transaction support in distributedlog. It > is > > one of the primitives that would help building distributed service. But > we > > have a concern about making this system become complicated and introduce > > operational overhead when it runs in the large scale system on > production. > > There are two major suggestions that I have for this feature - > > > > Build the 'minimum' logic in core - I think the minimum logic that need > to > > be added to the core is - the special control records (begin, commit and > > abort) and make the reader be able to detect those special control > records > > and know what do they mean and how to interrupt with them. Since they are > > special control records, there is not overhead to other readers that > > doesn't require this feature. > > > > Build the transaction coordinator as a separated proxy service - I think > > the major concern that we have is putting more complexities into the > 'write > > proxy' service. We architected distributedlog in a more microservice-like > > way - we have the core as the stream store, the proxy for serving write > and > > read traffic. It would be good that the transaction feature can be done > in > > a similar way. So the architecture would be like this - > > > > *[ write service ] [ read service ] [ transaction coordinator ]* > > *[ stream store > > ]* > > > > if people doesn't need the transaction feature, they can turn if off > > completely without any operational overhead. > > > > Beside that, I have one general question - What is the major goal for > this > > feature? Are you targeting on building a general XA transaction > coordinator > > or just for supporting things like `copy-modify-write' style workflow? > > > > > > Thanks, > > Sijie > > > > > > > > > > > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu....@gmail.com> wrote: > > > >> Ping? > >> > >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu....@gmail.com> wrote: > >> > >>> Sijie, > >>> > >>> No. I thought it might be easier for people to comment on a google doc > to > >>> gather the initial feedback. I will put the content back to wiki page > >> once > >>> addressing the comments. Does that sound good to you? > >>> > >>> And thank you in advance. > >>> > >>> - Xi > >>> > >>> > >>> > >>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote: > >>> > >>>> Hi Xi, > >>>> > >>>> sorry for late response. I will review it soon. > >>>> > >>>> regarding this, a separate question "are we going to use google doc > >>>> instead > >>>> of email thread for any discussion"? I am a bit worried that the > >>>> discussion > >>>> will become lost after moving to google doc. No idea on how other > apache > >>>> projects are doing. > >>>> > >>>> - Sijie > >>>> > >>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> > wrote: > >>>> > >>>>> Hi all, > >>>>> > >>>>> I finalized the first version of the design. This time I used a > google > >>>> doc > >>>>> so that it is easier for commenting and add a link the wiki page. I > >> will > >>>>> update this to the wiki page once we come to the finalized design. > >>>>> > >>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK > >>>>> bSIGgSzXuTI5BA/edit > >>>>> > >>>>> Let me know if you have any questions. Appreciate your reviews! > >>>>> > >>>>> - Xi > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> > >>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart > >>>>> <lstew...@twitter.com.invalid > >>>>>> wrote: > >>>>> > >>>>>> Interesting proposal. A couple quick notes while you continue to > >> flesh > >>>>> this > >>>>>> out. > >>>>>> > >>>>>> a. just to be sure - does this eliminate the need to save seqno with > >>>>>> checkpoint? > >>>>>> > >>>>>> b. i.e. another way to describe this kind of improvement is "support > >>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being > >> it > >>>>>> avoids the baggage of transactions. disadvantages include inability > >>>> to do > >>>>>> cross stream transactions, and flexibility (interleaving, etc) (are > >>>> there > >>>>>> others?). > >>>>>> > >>>>>> c. proxy use case is for supporting multiple writers - have you > >>>> thought > >>>>>> about how this would work with multiple writers? > >>>>>> > >>>>>> Thanks! > >>>>>> > >>>>>> > >>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo > >> <sij...@twitter.com.invalid > >>>>> > >>>>>> wrote: > >>>>>> > >>>>>>> Sound good to me. look forward to the detailed proposal. > >>>>>>> > >>>>>>> (I don't mind the format if it makes things easier to you) > >>>>>>> > >>>>>>> Sijie > >>>>>>> > >>>>>>> On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> wrote: > >>>>>>> > >>>>>>>> Thank you, Sijie > >>>>>>>> > >>>>>>>> We have some internal discussions to sort out some details. We > >> are > >>>>>> ready > >>>>>>> to > >>>>>>>> collaborate with the community for adding the transaction > >> support > >>>> in > >>>>>> DL. > >>>>>>>> We'd like to share more. > >>>>>>>> > >>>>>>>> I created a proposal wiki here - > >>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+ > >>>>>>>> DistributedLog+Transaction+Support > >>>>>>>> > >>>>>>>> (I followed KIP format and named it as DP (DistributedLog > >>>> Proposal - > >>>>> DP > >>>>>>> is > >>>>>>>> also short for Dynamic Programming). I don't know if you guys > >> like > >>>>> this > >>>>>>>> name or not. Feel free to change it :D) > >>>>>>>> > >>>>>>>> I basically put my initial email as the content there so far. > >>>> Once we > >>>>>>>> finished our final discussion, I will update with more details. > >> At > >>>>> the > >>>>>>> same > >>>>>>>> time, any comments are welcome. > >>>>>>>> > >>>>>>>> - Xi > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org > >>>>>>> <javascript:;>> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Xi, > >>>>>>>>> > >>>>>>>>> I just granted you the edit permission. > >>>>>>>>> > >>>>>>>>> - Sijie > >>>>>>>>> > >>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com > >>>>>>>> <javascript:;>> wrote: > >>>>>>>>> > >>>>>>>>>> I still can not edit the wiki. Can any of the pmc members > >>>> grant > >>>>> me > >>>>>>> the > >>>>>>>>>> permissions? > >>>>>>>>>> > >>>>>>>>>> - Xi > >>>>>>>>>> > >>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu < > >>>> xi.liu....@gmail.com > >>>>>>>> <javascript:;>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Sijie, > >>>>>>>>>>> > >>>>>>>>>>> I attempted to create a wiki page under that space. I > >> found > >>>>> that > >>>>>> I > >>>>>>> am > >>>>>>>>> not > >>>>>>>>>>> authorized with edit permission. > >>>>>>>>>>> > >>>>>>>>>>> Can any of the committers grant me the wiki edit > >>>> permission? My > >>>>>>>> account > >>>>>>>>>> is > >>>>>>>>>>> "xi.liu.ant". > >>>>>>>>>>> > >>>>>>>>>>> - Xi > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo < > >>>> si...@apache.org > >>>>>>>> <javascript:;>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> This sounds interesting ... I will take a closer look and > >>>> give > >>>>>> my > >>>>>>>>>> comments > >>>>>>>>>>>> later. > >>>>>>>>>>>> > >>>>>>>>>>>> At the same time, do you mind creating a wiki page to put > >>>> your > >>>>>>> idea > >>>>>>>>>> there? > >>>>>>>>>>>> You can add your wiki page under > >>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+ > >>>>>> Proposals > >>>>>>>>>>>> > >>>>>>>>>>>> You might need to ask in the dev list to grant the wiki > >>>> edit > >>>>>>>>> permissions > >>>>>>>>>>>> to > >>>>>>>>>>>> you once you have a wiki account. > >>>>>>>>>>>> > >>>>>>>>>>>> - Sijie > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu < > >>>> xi.liu....@gmail.com > >>>>>>>> <javascript:;>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Hello, > >>>>>>>>>>>>> > >>>>>>>>>>>>> I asked the transaction support in distributedlog user > >>>> group > >>>>>> two > >>>>>>>>>> months > >>>>>>>>>>>>> ago. I want to raise this up again, as we are looking > >> for > >>>>>> using > >>>>>>>>>>>>> distributedlog for building a transactional data > >>>> service. It > >>>>>> is > >>>>>>> a > >>>>>>>>>> major > >>>>>>>>>>>>> feature that is missing in distributedlog. We have some > >>>>> ideas > >>>>>> to > >>>>>>>> add > >>>>>>>>>>>> this > >>>>>>>>>>>>> to distributedlog and want to know if they make sense > >> or > >>>>> not. > >>>>>> If > >>>>>>>>> they > >>>>>>>>>>>> are > >>>>>>>>>>>>> good, we'd like to contribute and develop with the > >>>>> community. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Here are the thoughts: > >>>>>>>>>>>>> > >>>>>>>>>>>>> ------------------------------------------------- > >>>>>>>>>>>>> > >>>>>>>>>>>>> From our understanding, DL can provide "at-least-once" > >>>>>> delivery > >>>>>>>>>> semantic > >>>>>>>>>>>>> (if not, please correct me) but not "exactly-once" > >>>> delivery > >>>>>>>>> semantic. > >>>>>>>>>>>> That > >>>>>>>>>>>>> means that a message can be delivered one or more times > >>>> if > >>>>> the > >>>>>>>>> reader > >>>>>>>>>>>>> doesn't handle duplicates. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The duplicates come from two places, one is at writer > >>>> side > >>>>>> (this > >>>>>>>>>> assumes > >>>>>>>>>>>>> using write proxy not the core library), while the > >> other > >>>> one > >>>>>> is > >>>>>>> at > >>>>>>>>>>>> reader > >>>>>>>>>>>>> side. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - writer side: if the client attempts to write a record > >>>> to > >>>>> the > >>>>>>>> write > >>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then > >>>>> retries, > >>>>>>> the > >>>>>>>>>>>> retrying > >>>>>>>>>>>>> will potentially result in duplicates. > >>>>>>>>>>>>> - reader side:if the reader reads a message from a > >> stream > >>>>> and > >>>>>>> then > >>>>>>>>>>>> crashes, > >>>>>>>>>>>>> when the reader restarts it would restart from last > >> known > >>>>>>> position > >>>>>>>>>>>> (DLSN). > >>>>>>>>>>>>> If the reader fails after processing a record and > >> before > >>>>>>> recording > >>>>>>>>> the > >>>>>>>>>>>>> position, the processed record will be delivered again. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The reader problem can be properly addressed by making > >>>> use > >>>>> of > >>>>>>> the > >>>>>>>>>>>> sequence > >>>>>>>>>>>>> numbers of records and doing proper checkpointing. For > >>>>>> example, > >>>>>>> in > >>>>>>>>>>>>> database, it can checkpoint the indexed data with the > >>>>> sequence > >>>>>>>>> number > >>>>>>>>>> of > >>>>>>>>>>>>> records; in flink, it can checkpoint the state with the > >>>>>> sequence > >>>>>>>>>>>> numbers. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The writer problem can be addressed by implementing an > >>>>>>> idempotent > >>>>>>>>>>>> writer. > >>>>>>>>>>>>> However, an alternative and more powerful approach is > >> to > >>>>>> support > >>>>>>>>>>>>> transactions. > >>>>>>>>>>>>> > >>>>>>>>>>>>> *What does transaction mean?* > >>>>>>>>>>>>> > >>>>>>>>>>>>> A transaction means a collection of records can be > >>>> written > >>>>>>>>>>>> transactionally > >>>>>>>>>>>>> within a stream or across multiple streams. They will > >> be > >>>>>>> consumed > >>>>>>>> by > >>>>>>>>>> the > >>>>>>>>>>>>> reader together when a transaction is committed, or > >> will > >>>>> never > >>>>>>> be > >>>>>>>>>>>> consumed > >>>>>>>>>>>>> by the reader when the transaction is aborted. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The transaction will expose following guarantees: > >>>>>>>>>>>>> > >>>>>>>>>>>>> - The reader should not be exposed to records written > >>>> from > >>>>>>>>> uncommitted > >>>>>>>>>>>>> transactions (mandatory) > >>>>>>>>>>>>> - The reader should consume the records in the > >>>> transaction > >>>>>>> commit > >>>>>>>>>> order > >>>>>>>>>>>>> rather than the record written order (mandatory) > >>>>>>>>>>>>> - No duplicated records within a transaction > >> (mandatory) > >>>>>>>>>>>>> - Allow interleaving transactional writes and > >>>>>> non-transactional > >>>>>>>>> writes > >>>>>>>>>>>>> (optional) > >>>>>>>>>>>>> > >>>>>>>>>>>>> *Stream Transaction & Namespace Transaction* > >>>>>>>>>>>>> > >>>>>>>>>>>>> There will be two types of transaction, one is Stream > >>>> level > >>>>>>>>>> transaction > >>>>>>>>>>>>> (local transaction), while the other one is Namespace > >>>> level > >>>>>>>>>> transaction > >>>>>>>>>>>>> (global transaction). > >>>>>>>>>>>>> > >>>>>>>>>>>>> The stream level transaction is a transactional > >>>> operation on > >>>>>>>> writing > >>>>>>>>>>>>> records to one stream; the namespace level transaction > >>>> is a > >>>>>>>>>>>> transactional > >>>>>>>>>>>>> operation on writing records to multiple streams. > >>>>>>>>>>>>> > >>>>>>>>>>>>> *Implementation Thoughts* > >>>>>>>>>>>>> > >>>>>>>>>>>>> - A transaction is consist of begin control record, a > >>>> series > >>>>>> of > >>>>>>>> data > >>>>>>>>>>>>> records and commit/abort control record. > >>>>>>>>>>>>> - The begin/commit/abort control record is written to a > >>>>>> `commit` > >>>>>>>> log > >>>>>>>>>>>>> stream, while the data records will be written to > >> normal > >>>>> data > >>>>>>> log > >>>>>>>>>>>> streams. > >>>>>>>>>>>>> - The `commit` log stream will be the same log stream > >> for > >>>>>>>>> stream-level > >>>>>>>>>>>>> transaction, while it will be a *system* stream (or > >>>>> multiple > >>>>>>>> system > >>>>>>>>>>>>> streams) for namespace-level transactions. > >>>>>>>>>>>>> - The transaction code looks like as below: > >>>>>>>>>>>>> > >>>>>>>>>>>>> <code> > >>>>>>>>>>>>> > >>>>>>>>>>>>> Transaction txn = client.transaction(); > >>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record); > >>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record); > >>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record); > >>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit(); > >>>>>>>>>>>>> > >>>>>>>>>>>>> </code> > >>>>>>>>>>>>> > >>>>>>>>>>>>> if the txn is committed, all the write futures will be > >>>>>> satisfied > >>>>>>>>> with > >>>>>>>>>>>> their > >>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write > >>>> futures > >>>>>> will > >>>>>>>> be > >>>>>>>>>>>> failed > >>>>>>>>>>>>> together. there is no partial failure state. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - The actually data flow will be: > >>>>>>>>>>>>> > >>>>>>>>>>>>> 1. writer get a transaction id from the owner of the > >>>>> `commit' > >>>>>>> log > >>>>>>>>>> stream > >>>>>>>>>>>>> 1. write the begin control record (synchronously) with > >>>> the > >>>>>>>>> transaction > >>>>>>>>>>>> id > >>>>>>>>>>>>> 2. for each write within the same txn, it will be > >>>> assigned a > >>>>>>> local > >>>>>>>>>>>> sequence > >>>>>>>>>>>>> number starting from 0. the combination of transaction > >> id > >>>>> and > >>>>>>>> local > >>>>>>>>>>>>> sequence number will be used later on by the readers to > >>>>>>>> de-duplicate > >>>>>>>>>>>>> records. > >>>>>>>>>>>>> 3. the commit/abort control record will be written > >> based > >>>> on > >>>>>> the > >>>>>>>>>> results > >>>>>>>>>>>>> from 2. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Application can supply a timeout for the transaction > >>>> when > >>>>>>>>> #begin() a > >>>>>>>>>>>>> transaction. The owner of the `commit` log stream can > >>>> abort > >>>>>>>>>> transactions > >>>>>>>>>>>>> that never be committed/aborted within their timeout. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Failures: > >>>>>>>>>>>>> > >>>>>>>>>>>>> * all the log records can be simply retried as they > >> will > >>>> be > >>>>>>>>>>>> de-duplicated > >>>>>>>>>>>>> probably at the reader side. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Reader: > >>>>>>>>>>>>> > >>>>>>>>>>>>> * Reader can be configured to read uncommitted records > >> or > >>>>>>>> committed > >>>>>>>>>>>> records > >>>>>>>>>>>>> only (by default read uncommitted records) > >>>>>>>>>>>>> * If reader is configured to read committed records > >> only, > >>>>> the > >>>>>>> read > >>>>>>>>>> ahead > >>>>>>>>>>>>> cache will be changed to maintain one additional > >> pending > >>>>>>> committed > >>>>>>>>>>>> records. > >>>>>>>>>>>>> the pending committed records map is bounded and > >> records > >>>>> will > >>>>>> be > >>>>>>>>>> dropped > >>>>>>>>>>>>> when read ahead is moving. > >>>>>>>>>>>>> * when the reader hits a commit record, it will rewind > >> to > >>>>> the > >>>>>>>> begin > >>>>>>>>>>>> record > >>>>>>>>>>>>> and start reading from there. leveraging the proper > >> read > >>>>> ahead > >>>>>>>> cache > >>>>>>>>>> and > >>>>>>>>>>>>> pending commit records cache, it would be good for both > >>>>> short > >>>>>>>>>>>> transactions > >>>>>>>>>>>>> and long transactions. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - DLSN, SequenceId: > >>>>>>>>>>>>> > >>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local > >>>> sequence > >>>>>>>> number` > >>>>>>>>>>>> within > >>>>>>>>>>>>> a transaction session. So the new DLSN of records in a > >>>>>>> transaction > >>>>>>>>>> will > >>>>>>>>>>>> be > >>>>>>>>>>>>> the DLSN of commit control record plus its local > >> sequence > >>>>>>> number. > >>>>>>>>>>>>> * The sequence id will be still the position of the > >>>> commit > >>>>>>> record > >>>>>>>>> plus > >>>>>>>>>>>> its > >>>>>>>>>>>>> local sequence number. The position will be advanced > >> with > >>>>>> total > >>>>>>>>> number > >>>>>>>>>>>> of > >>>>>>>>>>>>> written records on writing the commit control record. > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Transaction Group & Namespace Transaction > >>>>>>>>>>>>> > >>>>>>>>>>>>> using one single log stream for namespace transaction > >> can > >>>>>> cause > >>>>>>>> the > >>>>>>>>>>>>> bottleneck problem since all the begin/commit/end > >> control > >>>>>>> records > >>>>>>>>> will > >>>>>>>>>>>> have > >>>>>>>>>>>>> to go through one log stream. > >>>>>>>>>>>>> > >>>>>>>>>>>>> the idea of 'transaction group' is to allow > >> partitioning > >>>> the > >>>>>>>> writers > >>>>>>>>>>>> into > >>>>>>>>>>>>> different transaction groups. > >>>>>>>>>>>>> > >>>>>>>>>>>>> clients can specify the `group-name` when starting the > >>>>>>>> transaction. > >>>>>>>>> if > >>>>>>>>>>>>> there is no `group-name` specified, it will use the > >>>> default > >>>>>>>> `commit` > >>>>>>>>>>>> log in > >>>>>>>>>>>>> the namespace for creating transactions. > >>>>>>>>>>>>> > >>>>>>>>>>>>> ------------------------------------------------- > >>>>>>>>>>>>> > >>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate > >>>> any > >>>>>>>> comments > >>>>>>>>>> and > >>>>>>>>>>>> if > >>>>>>>>>>>>> anyone is also interested in this idea, we'd like to > >>>>>> collaborate > >>>>>>>>> with > >>>>>>>>>>>> the > >>>>>>>>>>>>> community. > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> - Xi > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>> > >>> > >> > >