> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?

The use case I would have for transactions - at some level of the stack - is 
supporting dynamic configurations.

If a config changes in e.g. three lines, some of the changes may logically 
belong together. E.g. changing both “host” and “port” (if separate entries). 
One shouldn’t be able to read a state, even temporarily, that has new host but 
old port.

I can do this in the application level - it does not need to be part of the DL 
protocol.


Asko Kauppi
Zalando Tech Helsinki

> On 4 Jan 2017, at 9.18, Sijie Guo <si...@apache.org> wrote:
> 
> Sorry for late response. I think Leigh and you already had some very
> valuable discussions in the doc. I will try to add some of my questions to
> the discussion.
> 
> Beside that, I had a discussion with Leigh today about this. first of all,
> I think it is very good to add transaction support in distributedlog. It is
> one of the primitives that would help building distributed service. But we
> have a concern about making this system become complicated and introduce
> operational overhead when it runs in the large scale system on production.
> There are two major suggestions that I have for this feature -
> 
> Build the 'minimum' logic in core - I think the minimum logic that need to
> be added to the core is -  the special control records (begin, commit and
> abort) and make the reader be able to detect those special control records
> and know what do they mean and how to interrupt with them. Since they are
> special control records, there is not overhead to other readers that
> doesn't require this feature.
> 
> Build the transaction coordinator as a separated proxy service  - I think
> the major concern that we have is putting more complexities into the 'write
> proxy' service. We architected distributedlog in a more microservice-like
> way - we have the core as the stream store, the proxy for serving write and
> read traffic. It would be good that the transaction feature can be done in
> a similar way. So the architecture would be like this -
> 
> *[ write service ] [ read service ] [ transaction coordinator ]*
> *[ stream store
>            ]*
> 
> if people doesn't need the transaction feature, they can turn if off
> completely without any operational overhead.
> 
> Beside that, I have one general question - What is the major goal for this
> feature? Are you targeting on building a general XA transaction coordinator
> or just for supporting things like `copy-modify-write' style workflow?
> 
> 
> Thanks,
> Sijie
> 
> 
> 
> 
> 
> On Wed, Dec 28, 2016 at 1:12 PM, Xi Liu <xi.liu....@gmail.com> wrote:
> 
>> Ping?
>> 
>> On Mon, Dec 19, 2016 at 8:28 AM, Xi Liu <xi.liu....@gmail.com> wrote:
>> 
>>> Sijie,
>>> 
>>> No. I thought it might be easier for people to comment on a google doc to
>>> gather the initial feedback. I will put the content back to wiki page
>> once
>>> addressing the comments. Does that sound good to you?
>>> 
>>> And thank you in advance.
>>> 
>>> - Xi
>>> 
>>> 
>>> 
>>> On Sun, Dec 18, 2016 at 8:48 AM, Sijie Guo <si...@apache.org> wrote:
>>> 
>>>> Hi Xi,
>>>> 
>>>> sorry for late response. I will review it soon.
>>>> 
>>>> regarding this, a separate question "are we going to use google doc
>>>> instead
>>>> of email thread for any discussion"? I am a bit worried that the
>>>> discussion
>>>> will become lost after moving to google doc. No idea on how other apache
>>>> projects are doing.
>>>> 
>>>> - Sijie
>>>> 
>>>> On Wed, Dec 14, 2016 at 11:41 PM, Xi Liu <xi.liu....@gmail.com> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> I finalized the first version of the design. This time I used a google
>>>> doc
>>>>> so that it is easier for commenting and add a link the wiki page. I
>> will
>>>>> update this to the wiki page once we come to the finalized design.
>>>>> 
>>>>> https://docs.google.com/document/d/14Ns05M8Z5a6DF6fHmWQwISyD5jjeK
>>>>> bSIGgSzXuTI5BA/edit
>>>>> 
>>>>> Let me know if you have any questions. Appreciate your reviews!
>>>>> 
>>>>> - Xi
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Fri, Oct 28, 2016 at 7:58 AM, Leigh Stewart
>>>>> <lstew...@twitter.com.invalid
>>>>>> wrote:
>>>>> 
>>>>>> Interesting proposal. A couple quick notes while you continue to
>> flesh
>>>>> this
>>>>>> out.
>>>>>> 
>>>>>> a. just to be sure - does this eliminate the need to save seqno with
>>>>>> checkpoint?
>>>>>> 
>>>>>> b. i.e. another way to describe this kind of improvement is "support
>>>>>> records (atomic writes) larger than 1MB", iiuc. the advantage being
>> it
>>>>>> avoids the baggage of transactions. disadvantages include inability
>>>> to do
>>>>>> cross stream transactions, and flexibility (interleaving, etc) (are
>>>> there
>>>>>> others?).
>>>>>> 
>>>>>> c. proxy use case is for supporting multiple writers - have you
>>>> thought
>>>>>> about how this would work with multiple writers?
>>>>>> 
>>>>>> Thanks!
>>>>>> 
>>>>>> 
>>>>>> On Tue, Oct 18, 2016 at 6:45 PM, Sijie Guo
>> <sij...@twitter.com.invalid
>>>>> 
>>>>>> wrote:
>>>>>> 
>>>>>>> Sound good to me. look forward to the detailed proposal.
>>>>>>> 
>>>>>>> (I don't mind the format if it makes things easier to you)
>>>>>>> 
>>>>>>> Sijie
>>>>>>> 
>>>>>>> On Friday, October 14, 2016, Xi Liu <xi.liu....@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Thank you, Sijie
>>>>>>>> 
>>>>>>>> We have some internal discussions to sort out some details. We
>> are
>>>>>> ready
>>>>>>> to
>>>>>>>> collaborate with the community for adding the transaction
>> support
>>>> in
>>>>>> DL.
>>>>>>>> We'd like to share more.
>>>>>>>> 
>>>>>>>> I created a proposal wiki here -
>>>>>>>> https://cwiki.apache.org/confluence/display/DL/DP-1+-+
>>>>>>>> DistributedLog+Transaction+Support
>>>>>>>> 
>>>>>>>> (I followed KIP format and named it as DP (DistributedLog
>>>> Proposal -
>>>>> DP
>>>>>>> is
>>>>>>>> also short for Dynamic Programming). I don't know if you guys
>> like
>>>>> this
>>>>>>>> name or not. Feel free to change it :D)
>>>>>>>> 
>>>>>>>> I basically put my initial email as the content there so far.
>>>> Once we
>>>>>>>> finished our final discussion, I will update with more details.
>> At
>>>>> the
>>>>>>> same
>>>>>>>> time, any comments are welcome.
>>>>>>>> 
>>>>>>>> - Xi
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Sat, Oct 8, 2016 at 6:58 AM, Sijie Guo <si...@apache.org
>>>>>>> <javascript:;>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Xi,
>>>>>>>>> 
>>>>>>>>> I just granted you the edit permission.
>>>>>>>>> 
>>>>>>>>> - Sijie
>>>>>>>>> 
>>>>>>>>> On Fri, Oct 7, 2016 at 10:34 AM, Xi Liu <xi.liu....@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>> 
>>>>>>>>>> I still can not edit the wiki. Can any of the pmc members
>>>> grant
>>>>> me
>>>>>>> the
>>>>>>>>>> permissions?
>>>>>>>>>> 
>>>>>>>>>> - Xi
>>>>>>>>>> 
>>>>>>>>>> On Sat, Sep 17, 2016 at 10:35 PM, Xi Liu <
>>>> xi.liu....@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Sijie,
>>>>>>>>>>> 
>>>>>>>>>>> I attempted to create a wiki page under that space. I
>> found
>>>>> that
>>>>>> I
>>>>>>> am
>>>>>>>>> not
>>>>>>>>>>> authorized with edit permission.
>>>>>>>>>>> 
>>>>>>>>>>> Can any of the committers grant me the wiki edit
>>>> permission? My
>>>>>>>> account
>>>>>>>>>> is
>>>>>>>>>>> "xi.liu.ant".
>>>>>>>>>>> 
>>>>>>>>>>> - Xi
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On Tue, Sep 13, 2016 at 9:26 AM, Sijie Guo <
>>>> si...@apache.org
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> This sounds interesting ... I will take a closer look and
>>>> give
>>>>>> my
>>>>>>>>>> comments
>>>>>>>>>>>> later.
>>>>>>>>>>>> 
>>>>>>>>>>>> At the same time, do you mind creating a wiki page to put
>>>> your
>>>>>>> idea
>>>>>>>>>> there?
>>>>>>>>>>>> You can add your wiki page under
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/DL/Project+
>>>>>> Proposals
>>>>>>>>>>>> 
>>>>>>>>>>>> You might need to ask in the dev list to grant the wiki
>>>> edit
>>>>>>>>> permissions
>>>>>>>>>>>> to
>>>>>>>>>>>> you once you have a wiki account.
>>>>>>>>>>>> 
>>>>>>>>>>>> - Sijie
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Mon, Sep 12, 2016 at 2:20 AM, Xi Liu <
>>>> xi.liu....@gmail.com
>>>>>>>> <javascript:;>> wrote:
>>>>>>>>>>>> 
>>>>>>>>>>>>> Hello,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I asked the transaction support in distributedlog user
>>>> group
>>>>>> two
>>>>>>>>>> months
>>>>>>>>>>>>> ago. I want to raise this up again, as we are looking
>> for
>>>>>> using
>>>>>>>>>>>>> distributedlog for building a transactional data
>>>> service. It
>>>>>> is
>>>>>>> a
>>>>>>>>>> major
>>>>>>>>>>>>> feature that is missing in distributedlog. We have some
>>>>> ideas
>>>>>> to
>>>>>>>> add
>>>>>>>>>>>> this
>>>>>>>>>>>>> to distributedlog and want to know if they make sense
>> or
>>>>> not.
>>>>>> If
>>>>>>>>> they
>>>>>>>>>>>> are
>>>>>>>>>>>>> good, we'd like to contribute and develop with the
>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Here are the thoughts:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> From our understanding, DL can provide "at-least-once"
>>>>>> delivery
>>>>>>>>>> semantic
>>>>>>>>>>>>> (if not, please correct me) but not "exactly-once"
>>>> delivery
>>>>>>>>> semantic.
>>>>>>>>>>>> That
>>>>>>>>>>>>> means that a message can be delivered one or more times
>>>> if
>>>>> the
>>>>>>>>> reader
>>>>>>>>>>>>> doesn't handle duplicates.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The duplicates come from two places, one is at writer
>>>> side
>>>>>> (this
>>>>>>>>>> assumes
>>>>>>>>>>>>> using write proxy not the core library), while the
>> other
>>>> one
>>>>>> is
>>>>>>> at
>>>>>>>>>>>> reader
>>>>>>>>>>>>> side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - writer side: if the client attempts to write a record
>>>> to
>>>>> the
>>>>>>>> write
>>>>>>>>>>>>> proxies and gets a network error (e.g timeouts) then
>>>>> retries,
>>>>>>> the
>>>>>>>>>>>> retrying
>>>>>>>>>>>>> will potentially result in duplicates.
>>>>>>>>>>>>> - reader side:if the reader reads a message from a
>> stream
>>>>> and
>>>>>>> then
>>>>>>>>>>>> crashes,
>>>>>>>>>>>>> when the reader restarts it would restart from last
>> known
>>>>>>> position
>>>>>>>>>>>> (DLSN).
>>>>>>>>>>>>> If the reader fails after processing a record and
>> before
>>>>>>> recording
>>>>>>>>> the
>>>>>>>>>>>>> position, the processed record will be delivered again.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The reader problem can be properly addressed by making
>>>> use
>>>>> of
>>>>>>> the
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> numbers of records and doing proper checkpointing. For
>>>>>> example,
>>>>>>> in
>>>>>>>>>>>>> database, it can checkpoint the indexed data with the
>>>>> sequence
>>>>>>>>> number
>>>>>>>>>> of
>>>>>>>>>>>>> records; in flink, it can checkpoint the state with the
>>>>>> sequence
>>>>>>>>>>>> numbers.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The writer problem can be addressed by implementing an
>>>>>>> idempotent
>>>>>>>>>>>> writer.
>>>>>>>>>>>>> However, an alternative and more powerful approach is
>> to
>>>>>> support
>>>>>>>>>>>>> transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *What does transaction mean?*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> A transaction means a collection of records can be
>>>> written
>>>>>>>>>>>> transactionally
>>>>>>>>>>>>> within a stream or across multiple streams. They will
>> be
>>>>>>> consumed
>>>>>>>> by
>>>>>>>>>> the
>>>>>>>>>>>>> reader together when a transaction is committed, or
>> will
>>>>> never
>>>>>>> be
>>>>>>>>>>>> consumed
>>>>>>>>>>>>> by the reader when the transaction is aborted.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The transaction will expose following guarantees:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The reader should not be exposed to records written
>>>> from
>>>>>>>>> uncommitted
>>>>>>>>>>>>> transactions (mandatory)
>>>>>>>>>>>>> - The reader should consume the records in the
>>>> transaction
>>>>>>> commit
>>>>>>>>>> order
>>>>>>>>>>>>> rather than the record written order (mandatory)
>>>>>>>>>>>>> - No duplicated records within a transaction
>> (mandatory)
>>>>>>>>>>>>> - Allow interleaving transactional writes and
>>>>>> non-transactional
>>>>>>>>> writes
>>>>>>>>>>>>> (optional)
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Stream Transaction & Namespace Transaction*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> There will be two types of transaction, one is Stream
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (local transaction), while the other one is Namespace
>>>> level
>>>>>>>>>> transaction
>>>>>>>>>>>>> (global transaction).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The stream level transaction is a transactional
>>>> operation on
>>>>>>>> writing
>>>>>>>>>>>>> records to one stream; the namespace level transaction
>>>> is a
>>>>>>>>>>>> transactional
>>>>>>>>>>>>> operation on writing records to multiple streams.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> *Implementation Thoughts*
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - A transaction is consist of begin control record, a
>>>> series
>>>>>> of
>>>>>>>> data
>>>>>>>>>>>>> records and commit/abort control record.
>>>>>>>>>>>>> - The begin/commit/abort control record is written to a
>>>>>> `commit`
>>>>>>>> log
>>>>>>>>>>>>> stream, while the data records will be written to
>> normal
>>>>> data
>>>>>>> log
>>>>>>>>>>>> streams.
>>>>>>>>>>>>> - The `commit` log stream will be the same log stream
>> for
>>>>>>>>> stream-level
>>>>>>>>>>>>> transaction,  while it will be a *system* stream (or
>>>>> multiple
>>>>>>>> system
>>>>>>>>>>>>> streams) for namespace-level transactions.
>>>>>>>>>>>>> - The transaction code looks like as below:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> <code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Transaction txn = client.transaction();
>>>>>>>>>>>>> Future<DLSN> result1 = txn.write(stream-0, record);
>>>>>>>>>>>>> Future<DLSN> result2 = txn.write(stream-1, record);
>>>>>>>>>>>>> Future<DLSN> result3 = txn.write(stream-2, record);
>>>>>>>>>>>>> Future<Pair<DLSN, DLSN>> result = txn.commit();
>>>>>>>>>>>>> 
>>>>>>>>>>>>> </code>
>>>>>>>>>>>>> 
>>>>>>>>>>>>> if the txn is committed, all the write futures will be
>>>>>> satisfied
>>>>>>>>> with
>>>>>>>>>>>> their
>>>>>>>>>>>>> written DLSNs. if the txn is aborted, all the write
>>>> futures
>>>>>> will
>>>>>>>> be
>>>>>>>>>>>> failed
>>>>>>>>>>>>> together. there is no partial failure state.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - The actually data flow will be:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 1. writer get a transaction id from the owner of the
>>>>> `commit'
>>>>>>> log
>>>>>>>>>> stream
>>>>>>>>>>>>> 1. write the begin control record (synchronously) with
>>>> the
>>>>>>>>> transaction
>>>>>>>>>>>> id
>>>>>>>>>>>>> 2. for each write within the same txn, it will be
>>>> assigned a
>>>>>>> local
>>>>>>>>>>>> sequence
>>>>>>>>>>>>> number starting from 0. the combination of transaction
>> id
>>>>> and
>>>>>>>> local
>>>>>>>>>>>>> sequence number will be used later on by the readers to
>>>>>>>> de-duplicate
>>>>>>>>>>>>> records.
>>>>>>>>>>>>> 3. the commit/abort control record will be written
>> based
>>>> on
>>>>>> the
>>>>>>>>>> results
>>>>>>>>>>>>> from 2.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Application can supply a timeout for the transaction
>>>> when
>>>>>>>>> #begin() a
>>>>>>>>>>>>> transaction. The owner of the `commit` log stream can
>>>> abort
>>>>>>>>>> transactions
>>>>>>>>>>>>> that never be committed/aborted within their timeout.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Failures:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * all the log records can be simply retried as they
>> will
>>>> be
>>>>>>>>>>>> de-duplicated
>>>>>>>>>>>>> probably at the reader side.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Reader:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * Reader can be configured to read uncommitted records
>> or
>>>>>>>> committed
>>>>>>>>>>>> records
>>>>>>>>>>>>> only (by default read uncommitted records)
>>>>>>>>>>>>> * If reader is configured to read committed records
>> only,
>>>>> the
>>>>>>> read
>>>>>>>>>> ahead
>>>>>>>>>>>>> cache will be changed to maintain one additional
>> pending
>>>>>>> committed
>>>>>>>>>>>> records.
>>>>>>>>>>>>> the pending committed records map is bounded and
>> records
>>>>> will
>>>>>> be
>>>>>>>>>> dropped
>>>>>>>>>>>>> when read ahead is moving.
>>>>>>>>>>>>> * when the reader hits a commit record, it will rewind
>> to
>>>>> the
>>>>>>>> begin
>>>>>>>>>>>> record
>>>>>>>>>>>>> and start reading from there. leveraging the proper
>> read
>>>>> ahead
>>>>>>>> cache
>>>>>>>>>> and
>>>>>>>>>>>>> pending commit records cache, it would be good for both
>>>>> short
>>>>>>>>>>>> transactions
>>>>>>>>>>>>> and long transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - DLSN, SequenceId:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> * We will add a fourth field to DLSN. It is `local
>>>> sequence
>>>>>>>> number`
>>>>>>>>>>>> within
>>>>>>>>>>>>> a transaction session. So the new DLSN of records in a
>>>>>>> transaction
>>>>>>>>>> will
>>>>>>>>>>>> be
>>>>>>>>>>>>> the DLSN of commit control record plus its local
>> sequence
>>>>>>> number.
>>>>>>>>>>>>> * The sequence id will be still the position of the
>>>> commit
>>>>>>> record
>>>>>>>>> plus
>>>>>>>>>>>> its
>>>>>>>>>>>>> local sequence number. The position will be advanced
>> with
>>>>>> total
>>>>>>>>> number
>>>>>>>>>>>> of
>>>>>>>>>>>>> written records on writing the commit control record.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Transaction Group & Namespace Transaction
>>>>>>>>>>>>> 
>>>>>>>>>>>>> using one single log stream for namespace transaction
>> can
>>>>>> cause
>>>>>>>> the
>>>>>>>>>>>>> bottleneck problem since all the begin/commit/end
>> control
>>>>>>> records
>>>>>>>>> will
>>>>>>>>>>>> have
>>>>>>>>>>>>> to go through one log stream.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> the idea of 'transaction group' is to allow
>> partitioning
>>>> the
>>>>>>>> writers
>>>>>>>>>>>> into
>>>>>>>>>>>>> different transaction groups.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> clients can specify the `group-name` when starting the
>>>>>>>> transaction.
>>>>>>>>> if
>>>>>>>>>>>>> there is no `group-name` specified, it will use the
>>>> default
>>>>>>>> `commit`
>>>>>>>>>>>> log in
>>>>>>>>>>>>> the namespace for creating transactions.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> -------------------------------------------------
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I'd like to collect feedbacks on this idea. Appreciate
>>>> any
>>>>>>>> comments
>>>>>>>>>> and
>>>>>>>>>>>> if
>>>>>>>>>>>>> anyone is also interested in this idea, we'd like to
>>>>>> collaborate
>>>>>>>>> with
>>>>>>>>>>>> the
>>>>>>>>>>>>> community.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> - Xi
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>> 

Reply via email to