Xi, I added more comments. We are looking forward to your reply and seeing
this happen.

- Sijie

On Tue, Jan 3, 2017 at 11:18 PM, Sijie Guo <si...@apache.org> wrote:

> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
>
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
>
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
>
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
>
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>               ]*
>
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
>
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
>
>
> Thanks,
> Sijie
>
>
>
>
>
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu....@gmail.com> wrote:
>
>> Ping?
>>
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu....@gmail.com> wrote:
>>
>> > Sijie,
>> >
>> > No. I thought it might be easier for people to comment on a google doc
>> to
>> > gather the initial feedback. I will put the content back to wiki page
>> once
>> > addressing the comments. Does that sound good to you?
>> >
>> > And thank you in advance.
>> >
>> > - Xi
>> >
>> >
>> >
>> > On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>> >
>> >> Hi Xi,
>> >>
>> >> sorry for late response. I will review it soon.
>> >>
>> >> regarding this, a separate question "are we going to use google doc
>> >> instead
>> >> of email thread for any discussion"? I am a bit worried that the
>> >> discussion
>> >> will become lost after moving to google doc. No idea on how other
>> apache
>> >> projects are doing.
>> >>
>> >> - Sijie
>> >>
>> >> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> wrote:
>> >>
>> >> > Hi all,
>> >> >
>> >> > I finalized the first version of the design. This time I used a
>> google
>> >> doc
>> >> > so that it is easier for commenting and add a link the wiki page. I
>> will
>> >> > update this to the wiki page once we come to the finalized design.
>> >> >
>> >> > https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>> >> > bSIGgSzXuTI5BA/edit
>> >> >
>> >> > Let me know if you have any questions. Appreciate your reviews!
>> >> >
>> >> > - Xi
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>> >> > <lstew...@twitter.com.invalid
>> >> > > wrote:
>> >> >
>> >> > > Interesting proposal. A couple quick notes while you continue to
>> flesh
>> >> > this
>> >> > > out.
>> >> > >
>> >> > > a. just to be sure - does this eliminate the need to save seqno
>> with
>> >> > > checkpoint?
>> >> > >
>> >> > > b. i.e. another way to describe this kind of improvement is
>> "support
>> >> > > records (atomic writes) larger than 1MB", iiuc. the advantage
>> being it
>> >> > > avoids the baggage of transactions. disadvantages include inability
>> >> to do
>> >> > > cross stream transactions, and flexibility (interleaving, etc) (are
>> >> there
>> >> > > others?).
>> >> > >
>> >> > > c. proxy use case is for supporting multiple writers - have you
>> >> thought
>> >> > > about how this would work with multiple writers?
>> >> > >
>> >> > > Thanks!
>> >> > >
>> >> > >
>> >> > > On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sij...@twitter.com.invalid
>> >> >
>> >> > > wrote:
>> >> > >
>> >> > > > Sound good to me. look forward to the detailed proposal.
>> >> > > >
>> >> > > > (I don't mind the format if it makes things easier to you)
>> >> > > >
>> >> > > > Sijie
>> >> > > >
>> >> > > > On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com>
>> wrote:
>> >> > > >
>> >> > > > > Thank you, Sijie
>> >> > > > >
>> >> > > > > We have some internal discussions to sort out some details. We
>> are
>> >> > > ready
>> >> > > > to
>> >> > > > > collaborate with the community for adding the transaction
>> support
>> >> in
>> >> > > DL.
>> >> > > > > We'd like to share more.
>> >> > > > >
>> >> > > > > I created a proposal wiki here -
>> >> > > > > https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>> >> > > > > DistributedLog+Transaction+Support
>> >> > > > >
>> >> > > > > (I followed KIP format and named it as DP (DistributedLog
>> >> Proposal -
>> >> > DP
>> >> > > > is
>> >> > > > > also short for Dynamic Programming). I don't know if you guys
>> like
>> >> > this
>> >> > > > > name or not. Feel free to change it :D)
>> >> > > > >
>> >> > > > > I basically put my initial email as the content there so far.
>> >> Once we
>> >> > > > > finished our final discussion, I will update with more
>> details. At
>> >> > the
>> >> > > > same
>> >> > > > > time, any comments are welcome.
>> >> > > > >
>> >> > > > > - Xi
>> >> > > > >
>> >> > > > >
>> >> > > > >
>> >> > > > > On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org
>> >> > > > <javascript:;>>
>> >> > > > > wrote:
>> >> > > > >
>> >> > > > > > Xi,
>> >> > > > > >
>> >> > > > > > I just granted you the edit permission.
>> >> > > > > >
>> >> > > > > > - Sijie
>> >> > > > > >
>> >> > > > > > On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <
>> xi.liu....@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > >
>> >> > > > > > > I still can not edit the wiki. Can any of the pmc members
>> >> grant
>> >> > me
>> >> > > > the
>> >> > > > > > > permissions?
>> >> > > > > > >
>> >> > > > > > > - Xi
>> >> > > > > > >
>> >> > > > > > > On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>> >> xi.liu....@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > >
>> >> > > > > > > > Sijie,
>> >> > > > > > > >
>> >> > > > > > > > I attempted to create a wiki page under that space. I
>> found
>> >> > that
>> >> > > I
>> >> > > > am
>> >> > > > > > not
>> >> > > > > > > > authorized with edit permission.
>> >> > > > > > > >
>> >> > > > > > > > Can any of the committers grant me the wiki edit
>> >> permission? My
>> >> > > > > account
>> >> > > > > > > is
>> >> > > > > > > > "xi.liu.ant".
>> >> > > > > > > >
>> >> > > > > > > > - Xi
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>> >> si...@apache.org
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >
>> >> > > > > > > >> This sounds interesting ... I will take a closer look
>> and
>> >> give
>> >> > > my
>> >> > > > > > > comments
>> >> > > > > > > >> later.
>> >> > > > > > > >>
>> >> > > > > > > >> At the same time, do you mind creating a wiki page to
>> put
>> >> your
>> >> > > > idea
>> >> > > > > > > there?
>> >> > > > > > > >> You can add your wiki page under
>> >> > > > > > > >> https://cwiki.apache.org/confluence/display/DL/Project+
>> >> > > Proposals
>> >> > > > > > > >>
>> >> > > > > > > >> You might need to ask in the dev list to grant the wiki
>> >> edit
>> >> > > > > > permissions
>> >> > > > > > > >> to
>> >> > > > > > > >> you once you have a wiki account.
>> >> > > > > > > >>
>> >> > > > > > > >> - Sijie
>> >> > > > > > > >>
>> >> > > > > > > >>
>> >> > > > > > > >> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>> >> xi.liu....@gmail.com
>> >> > > > > <javascript:;>> wrote:
>> >> > > > > > > >>
>> >> > > > > > > >> > Hello,
>> >> > > > > > > >> >
>> >> > > > > > > >> > I asked the transaction support in distributedlog user
>> >> group
>> >> > > two
>> >> > > > > > > months
>> >> > > > > > > >> > ago. I want to raise this up again, as we are looking
>> for
>> >> > > using
>> >> > > > > > > >> > distributedlog for building a transactional data
>> >> service. It
>> >> > > is
>> >> > > > a
>> >> > > > > > > major
>> >> > > > > > > >> > feature that is missing in distributedlog. We have
>> some
>> >> > ideas
>> >> > > to
>> >> > > > > add
>> >> > > > > > > >> this
>> >> > > > > > > >> > to distributedlog and want to know if they make sense
>> or
>> >> > not.
>> >> > > If
>> >> > > > > > they
>> >> > > > > > > >> are
>> >> > > > > > > >> > good, we'd like to contribute and develop with the
>> >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> > Here are the thoughts:
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > From our understanding, DL can provide "at-least-once"
>> >> > > delivery
>> >> > > > > > > semantic
>> >> > > > > > > >> > (if not, please correct me) but not "exactly-once"
>> >> delivery
>> >> > > > > > semantic.
>> >> > > > > > > >> That
>> >> > > > > > > >> > means that a message can be delivered one or more
>> times
>> >> if
>> >> > the
>> >> > > > > > reader
>> >> > > > > > > >> > doesn't handle duplicates.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The duplicates come from two places, one is at writer
>> >> side
>> >> > > (this
>> >> > > > > > > assumes
>> >> > > > > > > >> > using write proxy not the core library), while the
>> other
>> >> one
>> >> > > is
>> >> > > > at
>> >> > > > > > > >> reader
>> >> > > > > > > >> > side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - writer side: if the client attempts to write a
>> record
>> >> to
>> >> > the
>> >> > > > > write
>> >> > > > > > > >> > proxies and gets a network error (e.g timeouts) then
>> >> > retries,
>> >> > > > the
>> >> > > > > > > >> retrying
>> >> > > > > > > >> > will potentially result in duplicates.
>> >> > > > > > > >> > - reader side:if the reader reads a message from a
>> stream
>> >> > and
>> >> > > > then
>> >> > > > > > > >> crashes,
>> >> > > > > > > >> > when the reader restarts it would restart from last
>> known
>> >> > > > position
>> >> > > > > > > >> (DLSN).
>> >> > > > > > > >> > If the reader fails after processing a record and
>> before
>> >> > > > recording
>> >> > > > > > the
>> >> > > > > > > >> > position, the processed record will be delivered
>> again.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The reader problem can be properly addressed by making
>> >> use
>> >> > of
>> >> > > > the
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > numbers of records and doing proper checkpointing. For
>> >> > > example,
>> >> > > > in
>> >> > > > > > > >> > database, it can checkpoint the indexed data with the
>> >> > sequence
>> >> > > > > > number
>> >> > > > > > > of
>> >> > > > > > > >> > records; in flink, it can checkpoint the state with
>> the
>> >> > > sequence
>> >> > > > > > > >> numbers.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The writer problem can be addressed by implementing an
>> >> > > > idempotent
>> >> > > > > > > >> writer.
>> >> > > > > > > >> > However, an alternative and more powerful approach is
>> to
>> >> > > support
>> >> > > > > > > >> > transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *What does transaction mean?*
>> >> > > > > > > >> >
>> >> > > > > > > >> > A transaction means a collection of records can be
>> >> written
>> >> > > > > > > >> transactionally
>> >> > > > > > > >> > within a stream or across multiple streams. They will
>> be
>> >> > > > consumed
>> >> > > > > by
>> >> > > > > > > the
>> >> > > > > > > >> > reader together when a transaction is committed, or
>> will
>> >> > never
>> >> > > > be
>> >> > > > > > > >> consumed
>> >> > > > > > > >> > by the reader when the transaction is aborted.
>> >> > > > > > > >> >
>> >> > > > > > > >> > The transaction will expose following guarantees:
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The reader should not be exposed to records written
>> >> from
>> >> > > > > > uncommitted
>> >> > > > > > > >> > transactions (mandatory)
>> >> > > > > > > >> > - The reader should consume the records in the
>> >> transaction
>> >> > > > commit
>> >> > > > > > > order
>> >> > > > > > > >> > rather than the record written order (mandatory)
>> >> > > > > > > >> > - No duplicated records within a transaction
>> (mandatory)
>> >> > > > > > > >> > - Allow interleaving transactional writes and
>> >> > > non-transactional
>> >> > > > > > writes
>> >> > > > > > > >> > (optional)
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Stream Transaction & Namespace Transaction*
>> >> > > > > > > >> >
>> >> > > > > > > >> > There will be two types of transaction, one is Stream
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (local transaction), while the other one is Namespace
>> >> level
>> >> > > > > > > transaction
>> >> > > > > > > >> > (global transaction).
>> >> > > > > > > >> >
>> >> > > > > > > >> > The stream level transaction is a transactional
>> >> operation on
>> >> > > > > writing
>> >> > > > > > > >> > records to one stream; the namespace level transaction
>> >> is a
>> >> > > > > > > >> transactional
>> >> > > > > > > >> > operation on writing records to multiple streams.
>> >> > > > > > > >> >
>> >> > > > > > > >> > *Implementation Thoughts*
>> >> > > > > > > >> >
>> >> > > > > > > >> > - A transaction is consist of begin control record, a
>> >> series
>> >> > > of
>> >> > > > > data
>> >> > > > > > > >> > records and commit/abort control record.
>> >> > > > > > > >> > - The begin/commit/abort control record is written to
>> a
>> >> > > `commit`
>> >> > > > > log
>> >> > > > > > > >> > stream, while the data records will be written to
>> normal
>> >> > data
>> >> > > > log
>> >> > > > > > > >> streams.
>> >> > > > > > > >> > - The `commit` log stream will be the same log stream
>> for
>> >> > > > > > stream-level
>> >> > > > > > > >> > transaction,  while it will be a *system* stream (or
>> >> > multiple
>> >> > > > > system
>> >> > > > > > > >> > streams) for namespace-level transactions.
>> >> > > > > > > >> > - The transaction code looks like as below:
>> >> > > > > > > >> >
>> >> > > > > > > >> > <code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > Transaction txn = client.transaction();
>> >> > > > > > > >> > Future<DLSN> result1 = txn.write(stream-0, record);
>> >> > > > > > > >> > Future<DLSN> result2 = txn.write(stream-1, record);
>> >> > > > > > > >> > Future<DLSN> result3 = txn.write(stream-2, record);
>> >> > > > > > > >> > Future<Pair<DLSN, DLSN>> result = txn.commit();
>> >> > > > > > > >> >
>> >> > > > > > > >> > </code>
>> >> > > > > > > >> >
>> >> > > > > > > >> > if the txn is committed, all the write futures will be
>> >> > > satisfied
>> >> > > > > > with
>> >> > > > > > > >> their
>> >> > > > > > > >> > written DLSNs. if the txn is aborted, all the write
>> >> futures
>> >> > > will
>> >> > > > > be
>> >> > > > > > > >> failed
>> >> > > > > > > >> > together. there is no partial failure state.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - The actually data flow will be:
>> >> > > > > > > >> >
>> >> > > > > > > >> > 1. writer get a transaction id from the owner of the
>> >> > `commit'
>> >> > > > log
>> >> > > > > > > stream
>> >> > > > > > > >> > 1. write the begin control record (synchronously) with
>> >> the
>> >> > > > > > transaction
>> >> > > > > > > >> id
>> >> > > > > > > >> > 2. for each write within the same txn, it will be
>> >> assigned a
>> >> > > > local
>> >> > > > > > > >> sequence
>> >> > > > > > > >> > number starting from 0. the combination of
>> transaction id
>> >> > and
>> >> > > > > local
>> >> > > > > > > >> > sequence number will be used later on by the readers
>> to
>> >> > > > > de-duplicate
>> >> > > > > > > >> > records.
>> >> > > > > > > >> > 3. the commit/abort control record will be written
>> based
>> >> on
>> >> > > the
>> >> > > > > > > results
>> >> > > > > > > >> > from 2.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Application can supply a timeout for the transaction
>> >> when
>> >> > > > > > #begin() a
>> >> > > > > > > >> > transaction. The owner of the `commit` log stream can
>> >> abort
>> >> > > > > > > transactions
>> >> > > > > > > >> > that never be committed/aborted within their timeout.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Failures:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * all the log records can be simply retried as they
>> will
>> >> be
>> >> > > > > > > >> de-duplicated
>> >> > > > > > > >> > probably at the reader side.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Reader:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * Reader can be configured to read uncommitted
>> records or
>> >> > > > > committed
>> >> > > > > > > >> records
>> >> > > > > > > >> > only (by default read uncommitted records)
>> >> > > > > > > >> > * If reader is configured to read committed records
>> only,
>> >> > the
>> >> > > > read
>> >> > > > > > > ahead
>> >> > > > > > > >> > cache will be changed to maintain one additional
>> pending
>> >> > > > committed
>> >> > > > > > > >> records.
>> >> > > > > > > >> > the pending committed records map is bounded and
>> records
>> >> > will
>> >> > > be
>> >> > > > > > > dropped
>> >> > > > > > > >> > when read ahead is moving.
>> >> > > > > > > >> > * when the reader hits a commit record, it will
>> rewind to
>> >> > the
>> >> > > > > begin
>> >> > > > > > > >> record
>> >> > > > > > > >> > and start reading from there. leveraging the proper
>> read
>> >> > ahead
>> >> > > > > cache
>> >> > > > > > > and
>> >> > > > > > > >> > pending commit records cache, it would be good for
>> both
>> >> > short
>> >> > > > > > > >> transactions
>> >> > > > > > > >> > and long transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - DLSN, SequenceId:
>> >> > > > > > > >> >
>> >> > > > > > > >> > * We will add a fourth field to DLSN. It is `local
>> >> sequence
>> >> > > > > number`
>> >> > > > > > > >> within
>> >> > > > > > > >> > a transaction session. So the new DLSN of records in a
>> >> > > > transaction
>> >> > > > > > > will
>> >> > > > > > > >> be
>> >> > > > > > > >> > the DLSN of commit control record plus its local
>> sequence
>> >> > > > number.
>> >> > > > > > > >> > * The sequence id will be still the position of the
>> >> commit
>> >> > > > record
>> >> > > > > > plus
>> >> > > > > > > >> its
>> >> > > > > > > >> > local sequence number. The position will be advanced
>> with
>> >> > > total
>> >> > > > > > number
>> >> > > > > > > >> of
>> >> > > > > > > >> > written records on writing the commit control record.
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Transaction Group & Namespace Transaction
>> >> > > > > > > >> >
>> >> > > > > > > >> > using one single log stream for namespace transaction
>> can
>> >> > > cause
>> >> > > > > the
>> >> > > > > > > >> > bottleneck problem since all the begin/commit/end
>> control
>> >> > > > records
>> >> > > > > > will
>> >> > > > > > > >> have
>> >> > > > > > > >> > to go through one log stream.
>> >> > > > > > > >> >
>> >> > > > > > > >> > the idea of 'transaction group' is to allow
>> partitioning
>> >> the
>> >> > > > > writers
>> >> > > > > > > >> into
>> >> > > > > > > >> > different transaction groups.
>> >> > > > > > > >> >
>> >> > > > > > > >> > clients can specify the `group-name` when starting the
>> >> > > > > transaction.
>> >> > > > > > if
>> >> > > > > > > >> > there is no `group-name` specified, it will use the
>> >> default
>> >> > > > > `commit`
>> >> > > > > > > >> log in
>> >> > > > > > > >> > the namespace for creating transactions.
>> >> > > > > > > >> >
>> >> > > > > > > >> > -------------------------------------------------
>> >> > > > > > > >> >
>> >> > > > > > > >> > I'd like to collect feedbacks on this idea. Appreciate
>> >> any
>> >> > > > > comments
>> >> > > > > > > and
>> >> > > > > > > >> if
>> >> > > > > > > >> > anyone is also interested in this idea, we'd like to
>> >> > > collaborate
>> >> > > > > > with
>> >> > > > > > > >> the
>> >> > > > > > > >> > community.
>> >> > > > > > > >> >
>> >> > > > > > > >> >
>> >> > > > > > > >> > - Xi
>> >> > > > > > > >> >
>> >> > > > > > > >>
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > >
>> >> > > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>>
>
>

Reply via email to