Ping? On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <[email protected]> wrote:
> Sijie, > > No. I thought it might be easier for people to comment on a google doc to > gather the initial feedback. I will put the content back to wiki page once > addressing the comments. Does that sound good to you? > > And thank you in advance. > > - Xi > > > > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <[email protected]> wrote: > >> Hi Xi, >> >> sorry for late response. I will review it soon. >> >> regarding this, a separate question "are we going to use google doc >> instead >> of email thread for any discussion"? I am a bit worried that the >> discussion >> will become lost after moving to google doc. No idea on how other apache >> projects are doing. >> >> - Sijie >> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <[email protected]> wrote: >> >> > Hi all, >> > >> > I finalized the first version of the design. This time I used a google >> doc >> > so that it is easier for commenting and add a link the wiki page. I will >> > update this to the wiki page once we come to the finalized design. >> > >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK >> > bSIGgSzXuTI5BA/edit >> > >> > Let me know if you have any questions. Appreciate your reviews! >> > >> > - Xi >> > >> > >> > >> > >> > >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart >> > <[email protected] >> > > wrote: >> > >> > > Interesting proposal. A couple quick notes while you continue to flesh >> > this >> > > out. >> > > >> > > a. just to be sure - does this eliminate the need to save seqno with >> > > checkpoint? >> > > >> > > b. i.e. another way to describe this kind of improvement is "support >> > > records (atomic writes) larger than 1MB", iiuc. the advantage being it >> > > avoids the baggage of transactions. disadvantages include inability >> to do >> > > cross stream transactions, and flexibility (interleaving, etc) (are >> there >> > > others?). >> > > >> > > c. proxy use case is for supporting multiple writers - have you >> thought >> > > about how this would work with multiple writers? >> > > >> > > Thanks! >> > > >> > > >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <[email protected] >> > >> > > wrote: >> > > >> > > > Sound good to me. look forward to the detailed proposal. >> > > > >> > > > (I don't mind the format if it makes things easier to you) >> > > > >> > > > Sijie >> > > > >> > > > On Friday, October 14, 2016, Xi Liu <[email protected]> wrote: >> > > > >> > > > > Thank you, Sijie >> > > > > >> > > > > We have some internal discussions to sort out some details. We are >> > > ready >> > > > to >> > > > > collaborate with the community for adding the transaction support >> in >> > > DL. >> > > > > We'd like to share more. >> > > > > >> > > > > I created a proposal wiki here - >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+ >> > > > > DistributedLog+Transaction+Support >> > > > > >> > > > > (I followed KIP format and named it as DP (DistributedLog >> Proposal - >> > DP >> > > > is >> > > > > also short for Dynamic Programming). I don't know if you guys like >> > this >> > > > > name or not. Feel free to change it :D) >> > > > > >> > > > > I basically put my initial email as the content there so far. >> Once we >> > > > > finished our final discussion, I will update with more details. At >> > the >> > > > same >> > > > > time, any comments are welcome. >> > > > > >> > > > > - Xi >> > > > > >> > > > > >> > > > > >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <[email protected] >> > > > <javascript:;>> >> > > > > wrote: >> > > > > >> > > > > > Xi, >> > > > > > >> > > > > > I just granted you the edit permission. >> > > > > > >> > > > > > - Sijie >> > > > > > >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <[email protected] >> > > > > <javascript:;>> wrote: >> > > > > > >> > > > > > > I still can not edit the wiki. Can any of the pmc members >> grant >> > me >> > > > the >> > > > > > > permissions? >> > > > > > > >> > > > > > > - Xi >> > > > > > > >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu < >> [email protected] >> > > > > <javascript:;>> wrote: >> > > > > > > >> > > > > > > > Sijie, >> > > > > > > > >> > > > > > > > I attempted to create a wiki page under that space. I found >> > that >> > > I >> > > > am >> > > > > > not >> > > > > > > > authorized with edit permission. >> > > > > > > > >> > > > > > > > Can any of the committers grant me the wiki edit >> permission? My >> > > > > account >> > > > > > > is >> > > > > > > > "xi.liu.ant". >> > > > > > > > >> > > > > > > > - Xi >> > > > > > > > >> > > > > > > > >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo < >> [email protected] >> > > > > <javascript:;>> wrote: >> > > > > > > > >> > > > > > > >> This sounds interesting ... I will take a closer look and >> give >> > > my >> > > > > > > comments >> > > > > > > >> later. >> > > > > > > >> >> > > > > > > >> At the same time, do you mind creating a wiki page to put >> your >> > > > idea >> > > > > > > there? >> > > > > > > >> You can add your wiki page under >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+ >> > > Proposals >> > > > > > > >> >> > > > > > > >> You might need to ask in the dev list to grant the wiki >> edit >> > > > > > permissions >> > > > > > > >> to >> > > > > > > >> you once you have a wiki account. >> > > > > > > >> >> > > > > > > >> - Sijie >> > > > > > > >> >> > > > > > > >> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu < >> [email protected] >> > > > > <javascript:;>> wrote: >> > > > > > > >> >> > > > > > > >> > Hello, >> > > > > > > >> > >> > > > > > > >> > I asked the transaction support in distributedlog user >> group >> > > two >> > > > > > > months >> > > > > > > >> > ago. I want to raise this up again, as we are looking for >> > > using >> > > > > > > >> > distributedlog for building a transactional data >> service. It >> > > is >> > > > a >> > > > > > > major >> > > > > > > >> > feature that is missing in distributedlog. We have some >> > ideas >> > > to >> > > > > add >> > > > > > > >> this >> > > > > > > >> > to distributedlog and want to know if they make sense or >> > not. >> > > If >> > > > > > they >> > > > > > > >> are >> > > > > > > >> > good, we'd like to contribute and develop with the >> > community. >> > > > > > > >> > >> > > > > > > >> > Here are the thoughts: >> > > > > > > >> > >> > > > > > > >> > ------------------------------------------------- >> > > > > > > >> > >> > > > > > > >> > From our understanding, DL can provide "at-least-once" >> > > delivery >> > > > > > > semantic >> > > > > > > >> > (if not, please correct me) but not "exactly-once" >> delivery >> > > > > > semantic. >> > > > > > > >> That >> > > > > > > >> > means that a message can be delivered one or more times >> if >> > the >> > > > > > reader >> > > > > > > >> > doesn't handle duplicates. >> > > > > > > >> > >> > > > > > > >> > The duplicates come from two places, one is at writer >> side >> > > (this >> > > > > > > assumes >> > > > > > > >> > using write proxy not the core library), while the other >> one >> > > is >> > > > at >> > > > > > > >> reader >> > > > > > > >> > side. >> > > > > > > >> > >> > > > > > > >> > - writer side: if the client attempts to write a record >> to >> > the >> > > > > write >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then >> > retries, >> > > > the >> > > > > > > >> retrying >> > > > > > > >> > will potentially result in duplicates. >> > > > > > > >> > - reader side:if the reader reads a message from a stream >> > and >> > > > then >> > > > > > > >> crashes, >> > > > > > > >> > when the reader restarts it would restart from last known >> > > > position >> > > > > > > >> (DLSN). >> > > > > > > >> > If the reader fails after processing a record and before >> > > > recording >> > > > > > the >> > > > > > > >> > position, the processed record will be delivered again. >> > > > > > > >> > >> > > > > > > >> > The reader problem can be properly addressed by making >> use >> > of >> > > > the >> > > > > > > >> sequence >> > > > > > > >> > numbers of records and doing proper checkpointing. For >> > > example, >> > > > in >> > > > > > > >> > database, it can checkpoint the indexed data with the >> > sequence >> > > > > > number >> > > > > > > of >> > > > > > > >> > records; in flink, it can checkpoint the state with the >> > > sequence >> > > > > > > >> numbers. >> > > > > > > >> > >> > > > > > > >> > The writer problem can be addressed by implementing an >> > > > idempotent >> > > > > > > >> writer. >> > > > > > > >> > However, an alternative and more powerful approach is to >> > > support >> > > > > > > >> > transactions. >> > > > > > > >> > >> > > > > > > >> > *What does transaction mean?* >> > > > > > > >> > >> > > > > > > >> > A transaction means a collection of records can be >> written >> > > > > > > >> transactionally >> > > > > > > >> > within a stream or across multiple streams. They will be >> > > > consumed >> > > > > by >> > > > > > > the >> > > > > > > >> > reader together when a transaction is committed, or will >> > never >> > > > be >> > > > > > > >> consumed >> > > > > > > >> > by the reader when the transaction is aborted. >> > > > > > > >> > >> > > > > > > >> > The transaction will expose following guarantees: >> > > > > > > >> > >> > > > > > > >> > - The reader should not be exposed to records written >> from >> > > > > > uncommitted >> > > > > > > >> > transactions (mandatory) >> > > > > > > >> > - The reader should consume the records in the >> transaction >> > > > commit >> > > > > > > order >> > > > > > > >> > rather than the record written order (mandatory) >> > > > > > > >> > - No duplicated records within a transaction (mandatory) >> > > > > > > >> > - Allow interleaving transactional writes and >> > > non-transactional >> > > > > > writes >> > > > > > > >> > (optional) >> > > > > > > >> > >> > > > > > > >> > *Stream Transaction & Namespace Transaction* >> > > > > > > >> > >> > > > > > > >> > There will be two types of transaction, one is Stream >> level >> > > > > > > transaction >> > > > > > > >> > (local transaction), while the other one is Namespace >> level >> > > > > > > transaction >> > > > > > > >> > (global transaction). >> > > > > > > >> > >> > > > > > > >> > The stream level transaction is a transactional >> operation on >> > > > > writing >> > > > > > > >> > records to one stream; the namespace level transaction >> is a >> > > > > > > >> transactional >> > > > > > > >> > operation on writing records to multiple streams. >> > > > > > > >> > >> > > > > > > >> > *Implementation Thoughts* >> > > > > > > >> > >> > > > > > > >> > - A transaction is consist of begin control record, a >> series >> > > of >> > > > > data >> > > > > > > >> > records and commit/abort control record. >> > > > > > > >> > - The begin/commit/abort control record is written to a >> > > `commit` >> > > > > log >> > > > > > > >> > stream, while the data records will be written to normal >> > data >> > > > log >> > > > > > > >> streams. >> > > > > > > >> > - The `commit` log stream will be the same log stream for >> > > > > > stream-level >> > > > > > > >> > transaction, while it will be a *system* stream (or >> > multiple >> > > > > system >> > > > > > > >> > streams) for namespace-level transactions. >> > > > > > > >> > - The transaction code looks like as below: >> > > > > > > >> > >> > > > > > > >> > <code> >> > > > > > > >> > >> > > > > > > >> > Transaction txn = client.transaction(); >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record); >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record); >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record); >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit(); >> > > > > > > >> > >> > > > > > > >> > </code> >> > > > > > > >> > >> > > > > > > >> > if the txn is committed, all the write futures will be >> > > satisfied >> > > > > > with >> > > > > > > >> their >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write >> futures >> > > will >> > > > > be >> > > > > > > >> failed >> > > > > > > >> > together. there is no partial failure state. >> > > > > > > >> > >> > > > > > > >> > - The actually data flow will be: >> > > > > > > >> > >> > > > > > > >> > 1. writer get a transaction id from the owner of the >> > `commit' >> > > > log >> > > > > > > stream >> > > > > > > >> > 1. write the begin control record (synchronously) with >> the >> > > > > > transaction >> > > > > > > >> id >> > > > > > > >> > 2. for each write within the same txn, it will be >> assigned a >> > > > local >> > > > > > > >> sequence >> > > > > > > >> > number starting from 0. the combination of transaction id >> > and >> > > > > local >> > > > > > > >> > sequence number will be used later on by the readers to >> > > > > de-duplicate >> > > > > > > >> > records. >> > > > > > > >> > 3. the commit/abort control record will be written based >> on >> > > the >> > > > > > > results >> > > > > > > >> > from 2. >> > > > > > > >> > >> > > > > > > >> > - Application can supply a timeout for the transaction >> when >> > > > > > #begin() a >> > > > > > > >> > transaction. The owner of the `commit` log stream can >> abort >> > > > > > > transactions >> > > > > > > >> > that never be committed/aborted within their timeout. >> > > > > > > >> > >> > > > > > > >> > - Failures: >> > > > > > > >> > >> > > > > > > >> > * all the log records can be simply retried as they will >> be >> > > > > > > >> de-duplicated >> > > > > > > >> > probably at the reader side. >> > > > > > > >> > >> > > > > > > >> > - Reader: >> > > > > > > >> > >> > > > > > > >> > * Reader can be configured to read uncommitted records or >> > > > > committed >> > > > > > > >> records >> > > > > > > >> > only (by default read uncommitted records) >> > > > > > > >> > * If reader is configured to read committed records only, >> > the >> > > > read >> > > > > > > ahead >> > > > > > > >> > cache will be changed to maintain one additional pending >> > > > committed >> > > > > > > >> records. >> > > > > > > >> > the pending committed records map is bounded and records >> > will >> > > be >> > > > > > > dropped >> > > > > > > >> > when read ahead is moving. >> > > > > > > >> > * when the reader hits a commit record, it will rewind to >> > the >> > > > > begin >> > > > > > > >> record >> > > > > > > >> > and start reading from there. leveraging the proper read >> > ahead >> > > > > cache >> > > > > > > and >> > > > > > > >> > pending commit records cache, it would be good for both >> > short >> > > > > > > >> transactions >> > > > > > > >> > and long transactions. >> > > > > > > >> > >> > > > > > > >> > - DLSN, SequenceId: >> > > > > > > >> > >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local >> sequence >> > > > > number` >> > > > > > > >> within >> > > > > > > >> > a transaction session. So the new DLSN of records in a >> > > > transaction >> > > > > > > will >> > > > > > > >> be >> > > > > > > >> > the DLSN of commit control record plus its local sequence >> > > > number. >> > > > > > > >> > * The sequence id will be still the position of the >> commit >> > > > record >> > > > > > plus >> > > > > > > >> its >> > > > > > > >> > local sequence number. The position will be advanced with >> > > total >> > > > > > number >> > > > > > > >> of >> > > > > > > >> > written records on writing the commit control record. >> > > > > > > >> > >> > > > > > > >> > - Transaction Group & Namespace Transaction >> > > > > > > >> > >> > > > > > > >> > using one single log stream for namespace transaction can >> > > cause >> > > > > the >> > > > > > > >> > bottleneck problem since all the begin/commit/end control >> > > > records >> > > > > > will >> > > > > > > >> have >> > > > > > > >> > to go through one log stream. >> > > > > > > >> > >> > > > > > > >> > the idea of 'transaction group' is to allow partitioning >> the >> > > > > writers >> > > > > > > >> into >> > > > > > > >> > different transaction groups. >> > > > > > > >> > >> > > > > > > >> > clients can specify the `group-name` when starting the >> > > > > transaction. >> > > > > > if >> > > > > > > >> > there is no `group-name` specified, it will use the >> default >> > > > > `commit` >> > > > > > > >> log in >> > > > > > > >> > the namespace for creating transactions. >> > > > > > > >> > >> > > > > > > >> > ------------------------------------------------- >> > > > > > > >> > >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate >> any >> > > > > comments >> > > > > > > and >> > > > > > > >> if >> > > > > > > >> > anyone is also interested in this idea, we'd like to >> > > collaborate >> > > > > > with >> > > > > > > >> the >> > > > > > > >> > community. >> > > > > > > >> > >> > > > > > > >> > >> > > > > > > >> > - Xi >> > > > > > > >> > >> > > > > > > >> >> > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >> > >
