The developers upload the design doc onto JIRA at least for HADOOP/HBase/Cassandra/... projects
On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <[email protected]> wrote: > Hi Xi, > > sorry for late response. I will review it soon. > > regarding this, a separate question "are we going to use google doc instead > of email thread for any discussion"? I am a bit worried that the discussion > will become lost after moving to google doc. No idea on how other apache > projects are doing. > > - Sijie > > On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <[email protected]> wrote: > >> Hi all, >> >> I finalized the first version of the design. This time I used a google doc >> so that it is easier for commenting and add a link the wiki page. I will >> update this to the wiki page once we come to the finalized design. >> >> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK >> bSIGgSzXuTI5BA/edit >> >> Let me know if you have any questions. Appreciate your reviews! >> >> - Xi >> >> >> >> >> >> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart >> <[email protected] >> > wrote: >> >> > Interesting proposal. A couple quick notes while you continue to flesh >> this >> > out. >> > >> > a. just to be sure - does this eliminate the need to save seqno with >> > checkpoint? >> > >> > b. i.e. another way to describe this kind of improvement is "support >> > records (atomic writes) larger than 1MB", iiuc. the advantage being it >> > avoids the baggage of transactions. disadvantages include inability to do >> > cross stream transactions, and flexibility (interleaving, etc) (are there >> > others?). >> > >> > c. proxy use case is for supporting multiple writers - have you thought >> > about how this would work with multiple writers? >> > >> > Thanks! >> > >> > >> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <[email protected]> >> > wrote: >> > >> > > Sound good to me. look forward to the detailed proposal. >> > > >> > > (I don't mind the format if it makes things easier to you) >> > > >> > > Sijie >> > > >> > > On Friday, October 14, 2016, Xi Liu <[email protected]> wrote: >> > > >> > > > Thank you, Sijie >> > > > >> > > > We have some internal discussions to sort out some details. We are >> > ready >> > > to >> > > > collaborate with the community for adding the transaction support in >> > DL. >> > > > We'd like to share more. >> > > > >> > > > I created a proposal wiki here - >> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+ >> > > > DistributedLog+Transaction+Support >> > > > >> > > > (I followed KIP format and named it as DP (DistributedLog Proposal - >> DP >> > > is >> > > > also short for Dynamic Programming). I don't know if you guys like >> this >> > > > name or not. Feel free to change it :D) >> > > > >> > > > I basically put my initial email as the content there so far. Once we >> > > > finished our final discussion, I will update with more details. At >> the >> > > same >> > > > time, any comments are welcome. >> > > > >> > > > - Xi >> > > > >> > > > >> > > > >> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <[email protected] >> > > <javascript:;>> >> > > > wrote: >> > > > >> > > > > Xi, >> > > > > >> > > > > I just granted you the edit permission. >> > > > > >> > > > > - Sijie >> > > > > >> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <[email protected] >> > > > <javascript:;>> wrote: >> > > > > >> > > > > > I still can not edit the wiki. Can any of the pmc members grant >> me >> > > the >> > > > > > permissions? >> > > > > > >> > > > > > - Xi >> > > > > > >> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <[email protected] >> > > > <javascript:;>> wrote: >> > > > > > >> > > > > > > Sijie, >> > > > > > > >> > > > > > > I attempted to create a wiki page under that space. I found >> that >> > I >> > > am >> > > > > not >> > > > > > > authorized with edit permission. >> > > > > > > >> > > > > > > Can any of the committers grant me the wiki edit permission? My >> > > > account >> > > > > > is >> > > > > > > "xi.liu.ant". >> > > > > > > >> > > > > > > - Xi >> > > > > > > >> > > > > > > >> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <[email protected] >> > > > <javascript:;>> wrote: >> > > > > > > >> > > > > > >> This sounds interesting ... I will take a closer look and give >> > my >> > > > > > comments >> > > > > > >> later. >> > > > > > >> >> > > > > > >> At the same time, do you mind creating a wiki page to put your >> > > idea >> > > > > > there? >> > > > > > >> You can add your wiki page under >> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+ >> > Proposals >> > > > > > >> >> > > > > > >> You might need to ask in the dev list to grant the wiki edit >> > > > > permissions >> > > > > > >> to >> > > > > > >> you once you have a wiki account. >> > > > > > >> >> > > > > > >> - Sijie >> > > > > > >> >> > > > > > >> >> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <[email protected] >> > > > <javascript:;>> wrote: >> > > > > > >> >> > > > > > >> > Hello, >> > > > > > >> > >> > > > > > >> > I asked the transaction support in distributedlog user group >> > two >> > > > > > months >> > > > > > >> > ago. I want to raise this up again, as we are looking for >> > using >> > > > > > >> > distributedlog for building a transactional data service. It >> > is >> > > a >> > > > > > major >> > > > > > >> > feature that is missing in distributedlog. We have some >> ideas >> > to >> > > > add >> > > > > > >> this >> > > > > > >> > to distributedlog and want to know if they make sense or >> not. >> > If >> > > > > they >> > > > > > >> are >> > > > > > >> > good, we'd like to contribute and develop with the >> community. >> > > > > > >> > >> > > > > > >> > Here are the thoughts: >> > > > > > >> > >> > > > > > >> > ------------------------------------------------- >> > > > > > >> > >> > > > > > >> > From our understanding, DL can provide "at-least-once" >> > delivery >> > > > > > semantic >> > > > > > >> > (if not, please correct me) but not "exactly-once" delivery >> > > > > semantic. >> > > > > > >> That >> > > > > > >> > means that a message can be delivered one or more times if >> the >> > > > > reader >> > > > > > >> > doesn't handle duplicates. >> > > > > > >> > >> > > > > > >> > The duplicates come from two places, one is at writer side >> > (this >> > > > > > assumes >> > > > > > >> > using write proxy not the core library), while the other one >> > is >> > > at >> > > > > > >> reader >> > > > > > >> > side. >> > > > > > >> > >> > > > > > >> > - writer side: if the client attempts to write a record to >> the >> > > > write >> > > > > > >> > proxies and gets a network error (e.g timeouts) then >> retries, >> > > the >> > > > > > >> retrying >> > > > > > >> > will potentially result in duplicates. >> > > > > > >> > - reader side:if the reader reads a message from a stream >> and >> > > then >> > > > > > >> crashes, >> > > > > > >> > when the reader restarts it would restart from last known >> > > position >> > > > > > >> (DLSN). >> > > > > > >> > If the reader fails after processing a record and before >> > > recording >> > > > > the >> > > > > > >> > position, the processed record will be delivered again. >> > > > > > >> > >> > > > > > >> > The reader problem can be properly addressed by making use >> of >> > > the >> > > > > > >> sequence >> > > > > > >> > numbers of records and doing proper checkpointing. For >> > example, >> > > in >> > > > > > >> > database, it can checkpoint the indexed data with the >> sequence >> > > > > number >> > > > > > of >> > > > > > >> > records; in flink, it can checkpoint the state with the >> > sequence >> > > > > > >> numbers. >> > > > > > >> > >> > > > > > >> > The writer problem can be addressed by implementing an >> > > idempotent >> > > > > > >> writer. >> > > > > > >> > However, an alternative and more powerful approach is to >> > support >> > > > > > >> > transactions. >> > > > > > >> > >> > > > > > >> > *What does transaction mean?* >> > > > > > >> > >> > > > > > >> > A transaction means a collection of records can be written >> > > > > > >> transactionally >> > > > > > >> > within a stream or across multiple streams. They will be >> > > consumed >> > > > by >> > > > > > the >> > > > > > >> > reader together when a transaction is committed, or will >> never >> > > be >> > > > > > >> consumed >> > > > > > >> > by the reader when the transaction is aborted. >> > > > > > >> > >> > > > > > >> > The transaction will expose following guarantees: >> > > > > > >> > >> > > > > > >> > - The reader should not be exposed to records written from >> > > > > uncommitted >> > > > > > >> > transactions (mandatory) >> > > > > > >> > - The reader should consume the records in the transaction >> > > commit >> > > > > > order >> > > > > > >> > rather than the record written order (mandatory) >> > > > > > >> > - No duplicated records within a transaction (mandatory) >> > > > > > >> > - Allow interleaving transactional writes and >> > non-transactional >> > > > > writes >> > > > > > >> > (optional) >> > > > > > >> > >> > > > > > >> > *Stream Transaction & Namespace Transaction* >> > > > > > >> > >> > > > > > >> > There will be two types of transaction, one is Stream level >> > > > > > transaction >> > > > > > >> > (local transaction), while the other one is Namespace level >> > > > > > transaction >> > > > > > >> > (global transaction). >> > > > > > >> > >> > > > > > >> > The stream level transaction is a transactional operation on >> > > > writing >> > > > > > >> > records to one stream; the namespace level transaction is a >> > > > > > >> transactional >> > > > > > >> > operation on writing records to multiple streams. >> > > > > > >> > >> > > > > > >> > *Implementation Thoughts* >> > > > > > >> > >> > > > > > >> > - A transaction is consist of begin control record, a series >> > of >> > > > data >> > > > > > >> > records and commit/abort control record. >> > > > > > >> > - The begin/commit/abort control record is written to a >> > `commit` >> > > > log >> > > > > > >> > stream, while the data records will be written to normal >> data >> > > log >> > > > > > >> streams. >> > > > > > >> > - The `commit` log stream will be the same log stream for >> > > > > stream-level >> > > > > > >> > transaction, while it will be a *system* stream (or >> multiple >> > > > system >> > > > > > >> > streams) for namespace-level transactions. >> > > > > > >> > - The transaction code looks like as below: >> > > > > > >> > >> > > > > > >> > <code> >> > > > > > >> > >> > > > > > >> > Transaction txn = client.transaction(); >> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record); >> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record); >> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record); >> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit(); >> > > > > > >> > >> > > > > > >> > </code> >> > > > > > >> > >> > > > > > >> > if the txn is committed, all the write futures will be >> > satisfied >> > > > > with >> > > > > > >> their >> > > > > > >> > written DLSNs. if the txn is aborted, all the write futures >> > will >> > > > be >> > > > > > >> failed >> > > > > > >> > together. there is no partial failure state. >> > > > > > >> > >> > > > > > >> > - The actually data flow will be: >> > > > > > >> > >> > > > > > >> > 1. writer get a transaction id from the owner of the >> `commit' >> > > log >> > > > > > stream >> > > > > > >> > 1. write the begin control record (synchronously) with the >> > > > > transaction >> > > > > > >> id >> > > > > > >> > 2. for each write within the same txn, it will be assigned a >> > > local >> > > > > > >> sequence >> > > > > > >> > number starting from 0. the combination of transaction id >> and >> > > > local >> > > > > > >> > sequence number will be used later on by the readers to >> > > > de-duplicate >> > > > > > >> > records. >> > > > > > >> > 3. the commit/abort control record will be written based on >> > the >> > > > > > results >> > > > > > >> > from 2. >> > > > > > >> > >> > > > > > >> > - Application can supply a timeout for the transaction when >> > > > > #begin() a >> > > > > > >> > transaction. The owner of the `commit` log stream can abort >> > > > > > transactions >> > > > > > >> > that never be committed/aborted within their timeout. >> > > > > > >> > >> > > > > > >> > - Failures: >> > > > > > >> > >> > > > > > >> > * all the log records can be simply retried as they will be >> > > > > > >> de-duplicated >> > > > > > >> > probably at the reader side. >> > > > > > >> > >> > > > > > >> > - Reader: >> > > > > > >> > >> > > > > > >> > * Reader can be configured to read uncommitted records or >> > > > committed >> > > > > > >> records >> > > > > > >> > only (by default read uncommitted records) >> > > > > > >> > * If reader is configured to read committed records only, >> the >> > > read >> > > > > > ahead >> > > > > > >> > cache will be changed to maintain one additional pending >> > > committed >> > > > > > >> records. >> > > > > > >> > the pending committed records map is bounded and records >> will >> > be >> > > > > > dropped >> > > > > > >> > when read ahead is moving. >> > > > > > >> > * when the reader hits a commit record, it will rewind to >> the >> > > > begin >> > > > > > >> record >> > > > > > >> > and start reading from there. leveraging the proper read >> ahead >> > > > cache >> > > > > > and >> > > > > > >> > pending commit records cache, it would be good for both >> short >> > > > > > >> transactions >> > > > > > >> > and long transactions. >> > > > > > >> > >> > > > > > >> > - DLSN, SequenceId: >> > > > > > >> > >> > > > > > >> > * We will add a fourth field to DLSN. It is `local sequence >> > > > number` >> > > > > > >> within >> > > > > > >> > a transaction session. So the new DLSN of records in a >> > > transaction >> > > > > > will >> > > > > > >> be >> > > > > > >> > the DLSN of commit control record plus its local sequence >> > > number. >> > > > > > >> > * The sequence id will be still the position of the commit >> > > record >> > > > > plus >> > > > > > >> its >> > > > > > >> > local sequence number. The position will be advanced with >> > total >> > > > > number >> > > > > > >> of >> > > > > > >> > written records on writing the commit control record. >> > > > > > >> > >> > > > > > >> > - Transaction Group & Namespace Transaction >> > > > > > >> > >> > > > > > >> > using one single log stream for namespace transaction can >> > cause >> > > > the >> > > > > > >> > bottleneck problem since all the begin/commit/end control >> > > records >> > > > > will >> > > > > > >> have >> > > > > > >> > to go through one log stream. >> > > > > > >> > >> > > > > > >> > the idea of 'transaction group' is to allow partitioning the >> > > > writers >> > > > > > >> into >> > > > > > >> > different transaction groups. >> > > > > > >> > >> > > > > > >> > clients can specify the `group-name` when starting the >> > > > transaction. >> > > > > if >> > > > > > >> > there is no `group-name` specified, it will use the default >> > > > `commit` >> > > > > > >> log in >> > > > > > >> > the namespace for creating transactions. >> > > > > > >> > >> > > > > > >> > ------------------------------------------------- >> > > > > > >> > >> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any >> > > > comments >> > > > > > and >> > > > > > >> if >> > > > > > >> > anyone is also interested in this idea, we'd like to >> > collaborate >> > > > > with >> > > > > > >> the >> > > > > > >> > community. >> > > > > > >> > >> > > > > > >> > >> > > > > > >> > - Xi >> > > > > > >> > >> > > > > > >> >> > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > >>
