Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Jan Lukavský Tue, 17 Dec 2019 02:29:00 -0800

Hi Mikhail,

On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:

inline

On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <[email protected]<mailto:[email protected]>> wrote:


    Hi,

    I actually thought that the proposal refers to Dataflow only. If
    this is supposed to be general, can we remove the
    Dataflow/Windmill specific parts and replace them with generic ones?

 I'll look into rephrasing doc to keep Dataflow/Windmill as example.

Cool, thanks!

    I'd have two more questions:

     a) the proposal is named "Slowly changing", why is the rate of
    change essential to the proposal? Once running on event time, that
    should not matter, or what am I missing?
Within this proposal, it is suggested to make a full snapshot of dataon every re-read. This is generally expensive and setting time eventto short interval might cause issues. Otherwise it is not essential.

Understood. This relates to table-stream duality, where the requirementsmight relax once you don't have to convert table to stream by re-readingit, but by being able to retrieve updates as you go (example would bereading directly from kafka or any other "commit log" abstraction).


     b) The description says: 'User wants to solve a stream enrichment
    problem. In brief request sounds like: ”I want to enrich each
    event in this stream by corresponding data from given table.”'.
    That is understandable, but would it be better to enable the user
    to express this intent directly (via Join operation)? The actual
    implementation might be runner (and input!) specific. The analogy
    is that when doing group-by-key operation, runner can choose hash
    grouping or sort-merge grouping, but that is not (directly)
    expressed in user code. I'm not saying that we should not have
    low-level transforms, just asking if it would be better to leave
    this decision to the runner (at least in some cases). It might be
    the case that we want to make core SDK as low level as possible
    (and as reasonable), I just want to make sure that that is really
    the intent.

The idea is to add basic operation with as small change as possiblefor current API.Ultimate goal is to have a Join/GBK operator that will choose properstrategy. However, I don't think that we have proper tools and view ofhow to choose best strategy at hand as of yet.

OK, cool. That is where I would find it very much useful to have somesort of "goals", that we are targeting. I agree that there are somepieces missing in the puzzle as of now. But it would be good to knowwhat these pieces are and what needs to be done to fulfill our goals.But this is probably not related to discussion of this proposal, butmore related to the concept of BIP or similar.


Thanks for the explanation.

    Thanks for the proposal!

    Jan

    On 12/17/19 12:01 AM, Kenneth Knowles wrote:

    I want to highlight that this design works for definitely more
    runners than just Dataflow. I see two pieces of it that I want to
    bring onto the thread:

    1. A new kind of "unbounded source" which is a periodic refresh
    of a bounded source, and use that as a side input. Each main
    input element has a window that maps to a specific refresh of the
    side input.
    2. Distributed map side inputs: supporting very large lookup
    tables, but with consistency challenges. Even the part about
    "windmill API" probably applies to other runners

    So I hope the title and "Objective" section do not cause people
    to stop reading.

    Kenn

    On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin
    <[email protected] <mailto:[email protected]>> wrote:

        +some people explicitly

        Can you please check on the doc and comment if it looks fine?

        Thank you,
        --Mikhail

        On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin
        <[email protected] <mailto:[email protected]>> wrote:

            "Good news, everyone-"
            ―Farnsworth

            Hi everyone,

            Recently, I was looking into relaxing limitations on side
            inputs in Dataflow runner. As part of it, I came up with
            design proposal for standardizing slowly changing
            dimensions use case in Beam and relevant changes to add
            support for distributed map side inputs.

            Please review and comment on design doc.
            
<https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
 [1]

            Thank you,
            Mikhail.

            -----

            [1]
            
https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Reply via email to