Xi, I added more comments. We are looking forward to your reply and seeing this happen.
- Sijie On Tue, Jan 3, 2017 at 11:18 PM, Sijie Guo <si...@apache.org> wrote: > Sorry for late response. I think Leigh and you already had some very > valuable discussions in the doc. I will try to add some of my questions to > the discussion. > > Beside that, I had a discussion with Leigh today about this. first of all, > I think it is very good to add transaction support in distributedlog. It is > one of the primitives that would help building distributed service. But we > have a concern about making this system become complicated and introduce > operational overhead when it runs in the large scale system on production. > There are two major suggestions that I have for this feature - > > Build the 'minimum' logic in core - I think the minimum logic that need to > be added to the core is - the special control records (begin, commit and > abort) and make the reader be able to detect those special control records > and know what do they mean and how to interrupt with them. Since they are > special control records, there is not overhead to other readers that > doesn't require this feature. > > Build the transaction coordinator as a separated proxy service - I think > the major concern that we have is putting more complexities into the 'write > proxy' service. We architected distributedlog in a more microservice-like > way - we have the core as the stream store, the proxy for serving write and > read traffic. It would be good that the transaction feature can be done in > a similar way. So the architecture would be like this - > > *[ write service ] [ read service ] [ transaction coordinator ]* > *[ stream store > ]* > > if people doesn't need the transaction feature, they can turn if off > completely without any operational overhead. > > Beside that, I have one general question - What is the major goal for this > feature? Are you targeting on building a general XA transaction coordinator > or just for supporting things like `copy-modify-write' style workflow? > > > Thanks, > Sijie > > > > > > On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu....@gmail.com> wrote: > >> Ping? >> >> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu....@gmail.com> wrote: >> >> > Sijie, >> > >> > No. I thought it might be easier for people to comment on a google doc >> to >> > gather the initial feedback. I will put the content back to wiki page >> once >> > addressing the comments. Does that sound good to you? >> > >> > And thank you in advance. >> > >> > - Xi >> > >> > >> > >> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote: >> > >> >> Hi Xi, >> >> >> >> sorry for late response. I will review it soon. >> >> >> >> regarding this, a separate question "are we going to use google doc >> >> instead >> >> of email thread for any discussion"? I am a bit worried that the >> >> discussion >> >> will become lost after moving to google doc. No idea on how other >> apache >> >> projects are doing. >> >> >> >> - Sijie >> >> >> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> wrote: >> >> >> >> > Hi all, >> >> > >> >> > I finalized the first version of the design. This time I used a >> google >> >> doc >> >> > so that it is easier for commenting and add a link the wiki page. I >> will >> >> > update this to the wiki page once we come to the finalized design. >> >> > >> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK >> >> > bSIGgSzXuTI5BA/edit >> >> > >> >> > Let me know if you have any questions. Appreciate your reviews! >> >> > >> >> > - Xi >> >> > >> >> > >> >> > >> >> > >> >> > >> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart >> >> > <lstew...@twitter.com.invalid >> >> > > wrote: >> >> > >> >> > > Interesting proposal. A couple quick notes while you continue to >> flesh >> >> > this >> >> > > out. >> >> > > >> >> > > a. just to be sure - does this eliminate the need to save seqno >> with >> >> > > checkpoint? >> >> > > >> >> > > b. i.e. another way to describe this kind of improvement is >> "support >> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage >> being it >> >> > > avoids the baggage of transactions. disadvantages include inability >> >> to do >> >> > > cross stream transactions, and flexibility (interleaving, etc) (are >> >> there >> >> > > others?). >> >> > > >> >> > > c. proxy use case is for supporting multiple writers - have you >> >> thought >> >> > > about how this would work with multiple writers? >> >> > > >> >> > > Thanks! >> >> > > >> >> > > >> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo >> <sij...@twitter.com.invalid >> >> > >> >> > > wrote: >> >> > > >> >> > > > Sound good to me. look forward to the detailed proposal. >> >> > > > >> >> > > > (I don't mind the format if it makes things easier to you) >> >> > > > >> >> > > > Sijie >> >> > > > >> >> > > > On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> >> wrote: >> >> > > > >> >> > > > > Thank you, Sijie >> >> > > > > >> >> > > > > We have some internal discussions to sort out some details. We >> are >> >> > > ready >> >> > > > to >> >> > > > > collaborate with the community for adding the transaction >> support >> >> in >> >> > > DL. >> >> > > > > We'd like to share more. >> >> > > > > >> >> > > > > I created a proposal wiki here - >> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+ >> >> > > > > DistributedLog+Transaction+Support >> >> > > > > >> >> > > > > (I followed KIP format and named it as DP (DistributedLog >> >> Proposal - >> >> > DP >> >> > > > is >> >> > > > > also short for Dynamic Programming). I don't know if you guys >> like >> >> > this >> >> > > > > name or not. Feel free to change it :D) >> >> > > > > >> >> > > > > I basically put my initial email as the content there so far. >> >> Once we >> >> > > > > finished our final discussion, I will update with more >> details. At >> >> > the >> >> > > > same >> >> > > > > time, any comments are welcome. >> >> > > > > >> >> > > > > - Xi >> >> > > > > >> >> > > > > >> >> > > > > >> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org >> >> > > > <javascript:;>> >> >> > > > > wrote: >> >> > > > > >> >> > > > > > Xi, >> >> > > > > > >> >> > > > > > I just granted you the edit permission. >> >> > > > > > >> >> > > > > > - Sijie >> >> > > > > > >> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu < >> xi.liu....@gmail.com >> >> > > > > <javascript:;>> wrote: >> >> > > > > > >> >> > > > > > > I still can not edit the wiki. Can any of the pmc members >> >> grant >> >> > me >> >> > > > the >> >> > > > > > > permissions? >> >> > > > > > > >> >> > > > > > > - Xi >> >> > > > > > > >> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu < >> >> xi.liu....@gmail.com >> >> > > > > <javascript:;>> wrote: >> >> > > > > > > >> >> > > > > > > > Sijie, >> >> > > > > > > > >> >> > > > > > > > I attempted to create a wiki page under that space. I >> found >> >> > that >> >> > > I >> >> > > > am >> >> > > > > > not >> >> > > > > > > > authorized with edit permission. >> >> > > > > > > > >> >> > > > > > > > Can any of the committers grant me the wiki edit >> >> permission? My >> >> > > > > account >> >> > > > > > > is >> >> > > > > > > > "xi.liu.ant". >> >> > > > > > > > >> >> > > > > > > > - Xi >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo < >> >> si...@apache.org >> >> > > > > <javascript:;>> wrote: >> >> > > > > > > > >> >> > > > > > > >> This sounds interesting ... I will take a closer look >> and >> >> give >> >> > > my >> >> > > > > > > comments >> >> > > > > > > >> later. >> >> > > > > > > >> >> >> > > > > > > >> At the same time, do you mind creating a wiki page to >> put >> >> your >> >> > > > idea >> >> > > > > > > there? >> >> > > > > > > >> You can add your wiki page under >> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+ >> >> > > Proposals >> >> > > > > > > >> >> >> > > > > > > >> You might need to ask in the dev list to grant the wiki >> >> edit >> >> > > > > > permissions >> >> > > > > > > >> to >> >> > > > > > > >> you once you have a wiki account. >> >> > > > > > > >> >> >> > > > > > > >> - Sijie >> >> > > > > > > >> >> >> > > > > > > >> >> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu < >> >> xi.liu....@gmail.com >> >> > > > > <javascript:;>> wrote: >> >> > > > > > > >> >> >> > > > > > > >> > Hello, >> >> > > > > > > >> > >> >> > > > > > > >> > I asked the transaction support in distributedlog user >> >> group >> >> > > two >> >> > > > > > > months >> >> > > > > > > >> > ago. I want to raise this up again, as we are looking >> for >> >> > > using >> >> > > > > > > >> > distributedlog for building a transactional data >> >> service. It >> >> > > is >> >> > > > a >> >> > > > > > > major >> >> > > > > > > >> > feature that is missing in distributedlog. We have >> some >> >> > ideas >> >> > > to >> >> > > > > add >> >> > > > > > > >> this >> >> > > > > > > >> > to distributedlog and want to know if they make sense >> or >> >> > not. >> >> > > If >> >> > > > > > they >> >> > > > > > > >> are >> >> > > > > > > >> > good, we'd like to contribute and develop with the >> >> > community. >> >> > > > > > > >> > >> >> > > > > > > >> > Here are the thoughts: >> >> > > > > > > >> > >> >> > > > > > > >> > ------------------------------------------------- >> >> > > > > > > >> > >> >> > > > > > > >> > From our understanding, DL can provide "at-least-once" >> >> > > delivery >> >> > > > > > > semantic >> >> > > > > > > >> > (if not, please correct me) but not "exactly-once" >> >> delivery >> >> > > > > > semantic. >> >> > > > > > > >> That >> >> > > > > > > >> > means that a message can be delivered one or more >> times >> >> if >> >> > the >> >> > > > > > reader >> >> > > > > > > >> > doesn't handle duplicates. >> >> > > > > > > >> > >> >> > > > > > > >> > The duplicates come from two places, one is at writer >> >> side >> >> > > (this >> >> > > > > > > assumes >> >> > > > > > > >> > using write proxy not the core library), while the >> other >> >> one >> >> > > is >> >> > > > at >> >> > > > > > > >> reader >> >> > > > > > > >> > side. >> >> > > > > > > >> > >> >> > > > > > > >> > - writer side: if the client attempts to write a >> record >> >> to >> >> > the >> >> > > > > write >> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then >> >> > retries, >> >> > > > the >> >> > > > > > > >> retrying >> >> > > > > > > >> > will potentially result in duplicates. >> >> > > > > > > >> > - reader side:if the reader reads a message from a >> stream >> >> > and >> >> > > > then >> >> > > > > > > >> crashes, >> >> > > > > > > >> > when the reader restarts it would restart from last >> known >> >> > > > position >> >> > > > > > > >> (DLSN). >> >> > > > > > > >> > If the reader fails after processing a record and >> before >> >> > > > recording >> >> > > > > > the >> >> > > > > > > >> > position, the processed record will be delivered >> again. >> >> > > > > > > >> > >> >> > > > > > > >> > The reader problem can be properly addressed by making >> >> use >> >> > of >> >> > > > the >> >> > > > > > > >> sequence >> >> > > > > > > >> > numbers of records and doing proper checkpointing. For >> >> > > example, >> >> > > > in >> >> > > > > > > >> > database, it can checkpoint the indexed data with the >> >> > sequence >> >> > > > > > number >> >> > > > > > > of >> >> > > > > > > >> > records; in flink, it can checkpoint the state with >> the >> >> > > sequence >> >> > > > > > > >> numbers. >> >> > > > > > > >> > >> >> > > > > > > >> > The writer problem can be addressed by implementing an >> >> > > > idempotent >> >> > > > > > > >> writer. >> >> > > > > > > >> > However, an alternative and more powerful approach is >> to >> >> > > support >> >> > > > > > > >> > transactions. >> >> > > > > > > >> > >> >> > > > > > > >> > *What does transaction mean?* >> >> > > > > > > >> > >> >> > > > > > > >> > A transaction means a collection of records can be >> >> written >> >> > > > > > > >> transactionally >> >> > > > > > > >> > within a stream or across multiple streams. They will >> be >> >> > > > consumed >> >> > > > > by >> >> > > > > > > the >> >> > > > > > > >> > reader together when a transaction is committed, or >> will >> >> > never >> >> > > > be >> >> > > > > > > >> consumed >> >> > > > > > > >> > by the reader when the transaction is aborted. >> >> > > > > > > >> > >> >> > > > > > > >> > The transaction will expose following guarantees: >> >> > > > > > > >> > >> >> > > > > > > >> > - The reader should not be exposed to records written >> >> from >> >> > > > > > uncommitted >> >> > > > > > > >> > transactions (mandatory) >> >> > > > > > > >> > - The reader should consume the records in the >> >> transaction >> >> > > > commit >> >> > > > > > > order >> >> > > > > > > >> > rather than the record written order (mandatory) >> >> > > > > > > >> > - No duplicated records within a transaction >> (mandatory) >> >> > > > > > > >> > - Allow interleaving transactional writes and >> >> > > non-transactional >> >> > > > > > writes >> >> > > > > > > >> > (optional) >> >> > > > > > > >> > >> >> > > > > > > >> > *Stream Transaction & Namespace Transaction* >> >> > > > > > > >> > >> >> > > > > > > >> > There will be two types of transaction, one is Stream >> >> level >> >> > > > > > > transaction >> >> > > > > > > >> > (local transaction), while the other one is Namespace >> >> level >> >> > > > > > > transaction >> >> > > > > > > >> > (global transaction). >> >> > > > > > > >> > >> >> > > > > > > >> > The stream level transaction is a transactional >> >> operation on >> >> > > > > writing >> >> > > > > > > >> > records to one stream; the namespace level transaction >> >> is a >> >> > > > > > > >> transactional >> >> > > > > > > >> > operation on writing records to multiple streams. >> >> > > > > > > >> > >> >> > > > > > > >> > *Implementation Thoughts* >> >> > > > > > > >> > >> >> > > > > > > >> > - A transaction is consist of begin control record, a >> >> series >> >> > > of >> >> > > > > data >> >> > > > > > > >> > records and commit/abort control record. >> >> > > > > > > >> > - The begin/commit/abort control record is written to >> a >> >> > > `commit` >> >> > > > > log >> >> > > > > > > >> > stream, while the data records will be written to >> normal >> >> > data >> >> > > > log >> >> > > > > > > >> streams. >> >> > > > > > > >> > - The `commit` log stream will be the same log stream >> for >> >> > > > > > stream-level >> >> > > > > > > >> > transaction, while it will be a *system* stream (or >> >> > multiple >> >> > > > > system >> >> > > > > > > >> > streams) for namespace-level transactions. >> >> > > > > > > >> > - The transaction code looks like as below: >> >> > > > > > > >> > >> >> > > > > > > >> > <code> >> >> > > > > > > >> > >> >> > > > > > > >> > Transaction txn = client.transaction(); >> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record); >> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record); >> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record); >> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit(); >> >> > > > > > > >> > >> >> > > > > > > >> > </code> >> >> > > > > > > >> > >> >> > > > > > > >> > if the txn is committed, all the write futures will be >> >> > > satisfied >> >> > > > > > with >> >> > > > > > > >> their >> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write >> >> futures >> >> > > will >> >> > > > > be >> >> > > > > > > >> failed >> >> > > > > > > >> > together. there is no partial failure state. >> >> > > > > > > >> > >> >> > > > > > > >> > - The actually data flow will be: >> >> > > > > > > >> > >> >> > > > > > > >> > 1. writer get a transaction id from the owner of the >> >> > `commit' >> >> > > > log >> >> > > > > > > stream >> >> > > > > > > >> > 1. write the begin control record (synchronously) with >> >> the >> >> > > > > > transaction >> >> > > > > > > >> id >> >> > > > > > > >> > 2. for each write within the same txn, it will be >> >> assigned a >> >> > > > local >> >> > > > > > > >> sequence >> >> > > > > > > >> > number starting from 0. the combination of >> transaction id >> >> > and >> >> > > > > local >> >> > > > > > > >> > sequence number will be used later on by the readers >> to >> >> > > > > de-duplicate >> >> > > > > > > >> > records. >> >> > > > > > > >> > 3. the commit/abort control record will be written >> based >> >> on >> >> > > the >> >> > > > > > > results >> >> > > > > > > >> > from 2. >> >> > > > > > > >> > >> >> > > > > > > >> > - Application can supply a timeout for the transaction >> >> when >> >> > > > > > #begin() a >> >> > > > > > > >> > transaction. The owner of the `commit` log stream can >> >> abort >> >> > > > > > > transactions >> >> > > > > > > >> > that never be committed/aborted within their timeout. >> >> > > > > > > >> > >> >> > > > > > > >> > - Failures: >> >> > > > > > > >> > >> >> > > > > > > >> > * all the log records can be simply retried as they >> will >> >> be >> >> > > > > > > >> de-duplicated >> >> > > > > > > >> > probably at the reader side. >> >> > > > > > > >> > >> >> > > > > > > >> > - Reader: >> >> > > > > > > >> > >> >> > > > > > > >> > * Reader can be configured to read uncommitted >> records or >> >> > > > > committed >> >> > > > > > > >> records >> >> > > > > > > >> > only (by default read uncommitted records) >> >> > > > > > > >> > * If reader is configured to read committed records >> only, >> >> > the >> >> > > > read >> >> > > > > > > ahead >> >> > > > > > > >> > cache will be changed to maintain one additional >> pending >> >> > > > committed >> >> > > > > > > >> records. >> >> > > > > > > >> > the pending committed records map is bounded and >> records >> >> > will >> >> > > be >> >> > > > > > > dropped >> >> > > > > > > >> > when read ahead is moving. >> >> > > > > > > >> > * when the reader hits a commit record, it will >> rewind to >> >> > the >> >> > > > > begin >> >> > > > > > > >> record >> >> > > > > > > >> > and start reading from there. leveraging the proper >> read >> >> > ahead >> >> > > > > cache >> >> > > > > > > and >> >> > > > > > > >> > pending commit records cache, it would be good for >> both >> >> > short >> >> > > > > > > >> transactions >> >> > > > > > > >> > and long transactions. >> >> > > > > > > >> > >> >> > > > > > > >> > - DLSN, SequenceId: >> >> > > > > > > >> > >> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local >> >> sequence >> >> > > > > number` >> >> > > > > > > >> within >> >> > > > > > > >> > a transaction session. So the new DLSN of records in a >> >> > > > transaction >> >> > > > > > > will >> >> > > > > > > >> be >> >> > > > > > > >> > the DLSN of commit control record plus its local >> sequence >> >> > > > number. >> >> > > > > > > >> > * The sequence id will be still the position of the >> >> commit >> >> > > > record >> >> > > > > > plus >> >> > > > > > > >> its >> >> > > > > > > >> > local sequence number. The position will be advanced >> with >> >> > > total >> >> > > > > > number >> >> > > > > > > >> of >> >> > > > > > > >> > written records on writing the commit control record. >> >> > > > > > > >> > >> >> > > > > > > >> > - Transaction Group & Namespace Transaction >> >> > > > > > > >> > >> >> > > > > > > >> > using one single log stream for namespace transaction >> can >> >> > > cause >> >> > > > > the >> >> > > > > > > >> > bottleneck problem since all the begin/commit/end >> control >> >> > > > records >> >> > > > > > will >> >> > > > > > > >> have >> >> > > > > > > >> > to go through one log stream. >> >> > > > > > > >> > >> >> > > > > > > >> > the idea of 'transaction group' is to allow >> partitioning >> >> the >> >> > > > > writers >> >> > > > > > > >> into >> >> > > > > > > >> > different transaction groups. >> >> > > > > > > >> > >> >> > > > > > > >> > clients can specify the `group-name` when starting the >> >> > > > > transaction. >> >> > > > > > if >> >> > > > > > > >> > there is no `group-name` specified, it will use the >> >> default >> >> > > > > `commit` >> >> > > > > > > >> log in >> >> > > > > > > >> > the namespace for creating transactions. >> >> > > > > > > >> > >> >> > > > > > > >> > ------------------------------------------------- >> >> > > > > > > >> > >> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate >> >> any >> >> > > > > comments >> >> > > > > > > and >> >> > > > > > > >> if >> >> > > > > > > >> > anyone is also interested in this idea, we'd like to >> >> > > collaborate >> >> > > > > > with >> >> > > > > > > >> the >> >> > > > > > > >> > community. >> >> > > > > > > >> > >> >> > > > > > > >> > >> >> > > > > > > >> > - Xi >> >> > > > > > > >> > >> >> > > > > > > >> >> >> > > > > > > > >> >> > > > > > > > >> >> > > > > > > >> >> > > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> > >> > >> > >