Re: [Discuss] Transaction Support

liang xie Sun, 18 Dec 2016 17:53:32 -0800

The developers upload the design doc onto JIRA at least for
HADOOP/HBase/Cassandra/... projects


On Mon, Dec 19, 2016 at 12:48 AM, Sijie Guo <[email protected]> wrote:
> Hi Xi,
>
> sorry for late response. I will review it soon.
>
> regarding this, a separate question "are we going to use google doc instead
> of email thread for any discussion"? I am a bit worried that the discussion
> will become lost after moving to google doc. No idea on how other apache
> projects are doing.
>
> - Sijie
>
> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <[email protected]> wrote:
>
>> Hi all,
>>
>> I finalized the first version of the design. This time I used a google doc
>> so that it is easier for commenting and add a link the wiki page. I will
>> update this to the wiki page once we come to the finalized design.
>>
>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> bSIGgSzXuTI5BA/edit
>>
>> Let me know if you have any questions. Appreciate your reviews!
>>
>> - Xi
>>
>>
>>
>>
>>
>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> <[email protected]
>> > wrote:
>>
>> > Interesting proposal. A couple quick notes while you continue to flesh
>> this
>> > out.
>> >
>> > a. just to be sure - does this eliminate the need to save seqno with
>> > checkpoint?
>> >
>> > b. i.e. another way to describe this kind of improvement is "support
>> > records (atomic writes) larger than 1MB", iiuc. the advantage being it
>> > avoids the baggage of transactions. disadvantages include inability to do
>> > cross stream transactions, and flexibility (interleaving, etc) (are there
>> > others?).
>> >
>> > c. proxy use case is for supporting multiple writers - have you thought
>> > about how this would work with multiple writers?
>> >
>> > Thanks!
>> >
>> >
>> > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo <[email protected]>
>> > wrote:
>> >
>> > > Sound good to me. look forward to the detailed proposal.
>> > >
>> > > (I don't mind the format if it makes things easier to you)
>> > >
>> > > Sijie
>> > >
>> > > On Friday, October 14, 2016, Xi Liu <[email protected]> wrote:
>> > >
>> > > > Thank you, Sijie
>> > > >
>> > > > We have some internal discussions to sort out some details. We are
>> > ready
>> > > to
>> > > > collaborate with the community for adding the transaction support in
>> > DL.
>> > > > We'd like to share more.
>> > > >
>> > > > I created a proposal wiki here -
>> > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> > > > DistributedLog+Transaction+Support
>> > > >
>> > > > (I followed KIP format and named it as DP (DistributedLog Proposal -
>> DP
>> > > is
>> > > > also short for Dynamic Programming). I don't know if you guys like
>> this
>> > > > name or not. Feel free to change it :D)
>> > > >
>> > > > I basically put my initial email as the content there so far. Once we
>> > > > finished our final discussion, I will update with more details. At
>> the
>> > > same
>> > > > time, any comments are welcome.
>> > > >
>> > > > - Xi
>> > > >
>> > > >
>> > > >
>> > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <[email protected]
>> > > <javascript:;>>
>> > > > wrote:
>> > > >
>> > > > > Xi,
>> > > > >
>> > > > > I just granted you the edit permission.
>> > > > >
>> > > > > - Sijie
>> > > > >
>> > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <[email protected]
>> > > > <javascript:;>> wrote:
>> > > > >
>> > > > > > I still can not edit the wiki. Can any of the pmc members grant
>> me
>> > > the
>> > > > > > permissions?
>> > > > > >
>> > > > > > - Xi
>> > > > > >
>> > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <[email protected]
>> > > > <javascript:;>> wrote:
>> > > > > >
>> > > > > > > Sijie,
>> > > > > > >
>> > > > > > > I attempted to create a wiki page under that space. I found
>> that
>> > I
>> > > am
>> > > > > not
>> > > > > > > authorized with edit permission.
>> > > > > > >
>> > > > > > > Can any of the committers grant me the wiki edit permission? My
>> > > > account
>> > > > > > is
>> > > > > > > "xi.liu.ant".
>> > > > > > >
>> > > > > > > - Xi
>> > > > > > >
>> > > > > > >
>> > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <[email protected]
>> > > > <javascript:;>> wrote:
>> > > > > > >
>> > > > > > >> This sounds interesting ... I will take a closer look and give
>> > my
>> > > > > > comments
>> > > > > > >> later.
>> > > > > > >>
>> > > > > > >> At the same time, do you mind creating a wiki page to put your
>> > > idea
>> > > > > > there?
>> > > > > > >> You can add your wiki page under
>> > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> > Proposals
>> > > > > > >>
>> > > > > > >> You might need to ask in the dev list to grant the wiki edit
>> > > > > permissions
>> > > > > > >> to
>> > > > > > >> you once you have a wiki account.
>> > > > > > >>
>> > > > > > >> - Sijie
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <[email protected]
>> > > > <javascript:;>> wrote:
>> > > > > > >>
>> > > > > > >> > Hello,
>> > > > > > >> >
>> > > > > > >> > I asked the transaction support in distributedlog user group
>> > two
>> > > > > > months
>> > > > > > >> > ago. I want to raise this up again, as we are looking for
>> > using
>> > > > > > >> > distributedlog for building a transactional data service. It
>> > is
>> > > a
>> > > > > > major
>> > > > > > >> > feature that is missing in distributedlog. We have some
>> ideas
>> > to
>> > > > add
>> > > > > > >> this
>> > > > > > >> > to distributedlog and want to know if they make sense or
>> not.
>> > If
>> > > > > they
>> > > > > > >> are
>> > > > > > >> > good, we'd like to contribute and develop with the
>> community.
>> > > > > > >> >
>> > > > > > >> > Here are the thoughts:
>> > > > > > >> >
>> > > > > > >> > -------------------------------------------------
>> > > > > > >> >
>> > > > > > >> > From our understanding, DL can provide "at-least-once"
>> > delivery
>> > > > > > semantic
>> > > > > > >> > (if not, please correct me) but not "exactly-once" delivery
>> > > > > semantic.
>> > > > > > >> That
>> > > > > > >> > means that a message can be delivered one or more times if
>> the
>> > > > > reader
>> > > > > > >> > doesn't handle duplicates.
>> > > > > > >> >
>> > > > > > >> > The duplicates come from two places, one is at writer side
>> > (this
>> > > > > > assumes
>> > > > > > >> > using write proxy not the core library), while the other one
>> > is
>> > > at
>> > > > > > >> reader
>> > > > > > >> > side.
>> > > > > > >> >
>> > > > > > >> > - writer side: if the client attempts to write a record to
>> the
>> > > > write
>> > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> retries,
>> > > the
>> > > > > > >> retrying
>> > > > > > >> > will potentially result in duplicates.
>> > > > > > >> > - reader side:if the reader reads a message from a stream
>> and
>> > > then
>> > > > > > >> crashes,
>> > > > > > >> > when the reader restarts it would restart from last known
>> > > position
>> > > > > > >> (DLSN).
>> > > > > > >> > If the reader fails after processing a record and before
>> > > recording
>> > > > > the
>> > > > > > >> > position, the processed record will be delivered again.
>> > > > > > >> >
>> > > > > > >> > The reader problem can be properly addressed by making use
>> of
>> > > the
>> > > > > > >> sequence
>> > > > > > >> > numbers of records and doing proper checkpointing. For
>> > example,
>> > > in
>> > > > > > >> > database, it can checkpoint the indexed data with the
>> sequence
>> > > > > number
>> > > > > > of
>> > > > > > >> > records; in flink, it can checkpoint the state with the
>> > sequence
>> > > > > > >> numbers.
>> > > > > > >> >
>> > > > > > >> > The writer problem can be addressed by implementing an
>> > > idempotent
>> > > > > > >> writer.
>> > > > > > >> > However, an alternative and more powerful approach is to
>> > support
>> > > > > > >> > transactions.
>> > > > > > >> >
>> > > > > > >> > *What does transaction mean?*
>> > > > > > >> >
>> > > > > > >> > A transaction means a collection of records can be written
>> > > > > > >> transactionally
>> > > > > > >> > within a stream or across multiple streams. They will be
>> > > consumed
>> > > > by
>> > > > > > the
>> > > > > > >> > reader together when a transaction is committed, or will
>> never
>> > > be
>> > > > > > >> consumed
>> > > > > > >> > by the reader when the transaction is aborted.
>> > > > > > >> >
>> > > > > > >> > The transaction will expose following guarantees:
>> > > > > > >> >
>> > > > > > >> > - The reader should not be exposed to records written from
>> > > > > uncommitted
>> > > > > > >> > transactions (mandatory)
>> > > > > > >> > - The reader should consume the records in the transaction
>> > > commit
>> > > > > > order
>> > > > > > >> > rather than the record written order (mandatory)
>> > > > > > >> > - No duplicated records within a transaction (mandatory)
>> > > > > > >> > - Allow interleaving transactional writes and
>> > non-transactional
>> > > > > writes
>> > > > > > >> > (optional)
>> > > > > > >> >
>> > > > > > >> > *Stream Transaction & Namespace Transaction*
>> > > > > > >> >
>> > > > > > >> > There will be two types of transaction, one is Stream level
>> > > > > > transaction
>> > > > > > >> > (local transaction), while the other one is Namespace level
>> > > > > > transaction
>> > > > > > >> > (global transaction).
>> > > > > > >> >
>> > > > > > >> > The stream level transaction is a transactional operation on
>> > > > writing
>> > > > > > >> > records to one stream; the namespace level transaction is a
>> > > > > > >> transactional
>> > > > > > >> > operation on writing records to multiple streams.
>> > > > > > >> >
>> > > > > > >> > *Implementation Thoughts*
>> > > > > > >> >
>> > > > > > >> > - A transaction is consist of begin control record, a series
>> > of
>> > > > data
>> > > > > > >> > records and commit/abort control record.
>> > > > > > >> > - The begin/commit/abort control record is written to a
>> > `commit`
>> > > > log
>> > > > > > >> > stream, while the data records will be written to normal
>> data
>> > > log
>> > > > > > >> streams.
>> > > > > > >> > - The `commit` log stream will be the same log stream for
>> > > > > stream-level
>> > > > > > >> > transaction,  while it will be a *system* stream (or
>> multiple
>> > > > system
>> > > > > > >> > streams) for namespace-level transactions.
>> > > > > > >> > - The transaction code looks like as below:
>> > > > > > >> >
>> > > > > > >> > <code>
>> > > > > > >> >
>> > > > > > >> > Transaction txn = client.transaction();
>> > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> > > > > > >> >
>> > > > > > >> > </code>
>> > > > > > >> >
>> > > > > > >> > if the txn is committed, all the write futures will be
>> > satisfied
>> > > > > with
>> > > > > > >> their
>> > > > > > >> > written DLSNs. if the txn is aborted, all the write futures
>> > will
>> > > > be
>> > > > > > >> failed
>> > > > > > >> > together. there is no partial failure state.
>> > > > > > >> >
>> > > > > > >> > - The actually data flow will be:
>> > > > > > >> >
>> > > > > > >> > 1. writer get a transaction id from the owner of the
>> `commit'
>> > > log
>> > > > > > stream
>> > > > > > >> > 1. write the begin control record (synchronously) with the
>> > > > > transaction
>> > > > > > >> id
>> > > > > > >> > 2. for each write within the same txn, it will be assigned a
>> > > local
>> > > > > > >> sequence
>> > > > > > >> > number starting from 0. the combination of transaction id
>> and
>> > > > local
>> > > > > > >> > sequence number will be used later on by the readers to
>> > > > de-duplicate
>> > > > > > >> > records.
>> > > > > > >> > 3. the commit/abort control record will be written based on
>> > the
>> > > > > > results
>> > > > > > >> > from 2.
>> > > > > > >> >
>> > > > > > >> > - Application can supply a timeout for the transaction when
>> > > > > #begin() a
>> > > > > > >> > transaction. The owner of the `commit` log stream can abort
>> > > > > > transactions
>> > > > > > >> > that never be committed/aborted within their timeout.
>> > > > > > >> >
>> > > > > > >> > - Failures:
>> > > > > > >> >
>> > > > > > >> > * all the log records can be simply retried as they will be
>> > > > > > >> de-duplicated
>> > > > > > >> > probably at the reader side.
>> > > > > > >> >
>> > > > > > >> > - Reader:
>> > > > > > >> >
>> > > > > > >> > * Reader can be configured to read uncommitted records or
>> > > > committed
>> > > > > > >> records
>> > > > > > >> > only (by default read uncommitted records)
>> > > > > > >> > * If reader is configured to read committed records only,
>> the
>> > > read
>> > > > > > ahead
>> > > > > > >> > cache will be changed to maintain one additional pending
>> > > committed
>> > > > > > >> records.
>> > > > > > >> > the pending committed records map is bounded and records
>> will
>> > be
>> > > > > > dropped
>> > > > > > >> > when read ahead is moving.
>> > > > > > >> > * when the reader hits a commit record, it will rewind to
>> the
>> > > > begin
>> > > > > > >> record
>> > > > > > >> > and start reading from there. leveraging the proper read
>> ahead
>> > > > cache
>> > > > > > and
>> > > > > > >> > pending commit records cache, it would be good for both
>> short
>> > > > > > >> transactions
>> > > > > > >> > and long transactions.
>> > > > > > >> >
>> > > > > > >> > - DLSN, SequenceId:
>> > > > > > >> >
>> > > > > > >> > * We will add a fourth field to DLSN. It is `local sequence
>> > > > number`
>> > > > > > >> within
>> > > > > > >> > a transaction session. So the new DLSN of records in a
>> > > transaction
>> > > > > > will
>> > > > > > >> be
>> > > > > > >> > the DLSN of commit control record plus its local sequence
>> > > number.
>> > > > > > >> > * The sequence id will be still the position of the commit
>> > > record
>> > > > > plus
>> > > > > > >> its
>> > > > > > >> > local sequence number. The position will be advanced with
>> > total
>> > > > > number
>> > > > > > >> of
>> > > > > > >> > written records on writing the commit control record.
>> > > > > > >> >
>> > > > > > >> > - Transaction Group & Namespace Transaction
>> > > > > > >> >
>> > > > > > >> > using one single log stream for namespace transaction can
>> > cause
>> > > > the
>> > > > > > >> > bottleneck problem since all the begin/commit/end control
>> > > records
>> > > > > will
>> > > > > > >> have
>> > > > > > >> > to go through one log stream.
>> > > > > > >> >
>> > > > > > >> > the idea of 'transaction group' is to allow partitioning the
>> > > > writers
>> > > > > > >> into
>> > > > > > >> > different transaction groups.
>> > > > > > >> >
>> > > > > > >> > clients can specify the `group-name` when starting the
>> > > > transaction.
>> > > > > if
>> > > > > > >> > there is no `group-name` specified, it will use the default
>> > > > `commit`
>> > > > > > >> log in
>> > > > > > >> > the namespace for creating transactions.
>> > > > > > >> >
>> > > > > > >> > -------------------------------------------------
>> > > > > > >> >
>> > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate any
>> > > > comments
>> > > > > > and
>> > > > > > >> if
>> > > > > > >> > anyone is also interested in this idea, we'd like to
>> > collaborate
>> > > > > with
>> > > > > > >> the
>> > > > > > >> > community.
>> > > > > > >> >
>> > > > > > >> >
>> > > > > > >> > - Xi
>> > > > > > >> >
>> > > > > > >>
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>

Re: [Discuss] Transaction Support

Reply via email to