Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Mikhail Gryzykhin Mon, 23 Mar 2020 10:12:48 -0700

UPD:
I have updated doc with API suggestions, please check on relevant section
of the doc [1]
<https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732>


--Mikhail

[1]
https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732

On Thu, Jan 16, 2020 at 2:52 AM Reza Rokni <[email protected]> wrote:

> +1 To this proposal, this is a very common pattern requirement from users.
> With the following current workaround having seen a lot of traction:
>
>
> https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs
>
> Making this process simpler for users and Out Of the Box, would be a great
> win!
>
> I would also mention that ideally we will also cover the large distributed
> side inputs, but a lot of the core cases for this comes down to Side inputs
> that do fit in memory. Perhaps worth putting priorities on the work with
> the smaller side input tables having precedence. Unless the work will cover
> both cases in the same way of course.
>
> Cheers
>
> Reza
>
> On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles <[email protected]> wrote:
>
>> I do think that the implementation concerns around larger side inputs are
>> relevant to most runners. Ideally there would be no model change necessary.
>> Triggers are harder and bring in consistency concerns, which are even more
>> likely to be relevant to all runners.
>>
>> Kenn
>>
>> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <[email protected]> wrote:
>>
>>> Most of the doc is about how to support distributed side inputs in
>>> Dataflow and doesn't really cover how the Beam model (accumulating,
>>> discarding, retraction) triggers impact what are the "contents" of a
>>> PCollection in time and how this proposal for a limited set of side input
>>> shapes can work to support larger side inputs in Dataflow.
>>>
>>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <[email protected]> wrote:
>>>
>>>> Hi Mikhail,
>>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote:
>>>>
>>>> inline
>>>>
>>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <[email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I actually thought that the proposal refers to Dataflow only. If this
>>>>> is supposed to be general, can we remove the Dataflow/Windmill specific
>>>>> parts and replace them with generic ones?
>>>>>
>>>>  I'll look into rephrasing doc to keep Dataflow/Windmill as example.
>>>>
>>>> Cool, thanks!
>>>>
>>>> I'd have two more questions:
>>>>>
>>>>>  a) the proposal is named "Slowly changing", why is the rate of change
>>>>> essential to the proposal? Once running on event time, that should not
>>>>> matter, or what am I missing?
>>>>>
>>>> Within this proposal, it is suggested to make a full snapshot of data
>>>> on every re-read. This is generally expensive and setting time event to
>>>> short interval might cause issues. Otherwise it is not essential.
>>>>
>>>> Understood. This relates to table-stream duality, where the
>>>> requirements might relax once you don't have to convert table to stream by
>>>> re-reading it, but by being able to retrieve updates as you go (example
>>>> would be reading directly from kafka or any other "commit log" 
>>>> abstraction).
>>>>
>>>>  b) The description says: 'User wants to solve a stream enrichment
>>>>> problem. In brief request sounds like: ”I want to enrich each event in 
>>>>> this
>>>>> stream by corresponding data from given table.”'. That is understandable,
>>>>> but would it be better to enable the user to express this intent directly
>>>>> (via Join operation)? The actual implementation might be runner (and
>>>>> input!) specific. The analogy is that when doing group-by-key operation,
>>>>> runner can choose hash grouping or sort-merge grouping, but that is not
>>>>> (directly) expressed in user code. I'm not saying that we should not have
>>>>> low-level transforms, just asking if it would be better to leave this
>>>>> decision to the runner (at least in some cases). It might be the case that
>>>>> we want to make core SDK as low level as possible (and as reasonable), I
>>>>> just want to make sure that that is really the intent.
>>>>>
>>>> The idea is to add basic operation with as small change as possible for
>>>> current API.
>>>> Ultimate goal is to have a Join/GBK operator that will choose proper
>>>> strategy. However, I don't think that we have proper tools and view of how
>>>> to choose best strategy at hand as of yet.
>>>>
>>>> OK, cool. That is where I would find it very much useful to have some
>>>> sort of "goals", that we are targeting. I agree that there are some pieces
>>>> missing in the puzzle as of now. But it would be good to know what these
>>>> pieces are and what needs to be done to fulfill our goals. But this is
>>>> probably not related to discussion of this proposal, but more related to
>>>> the concept of BIP or similar.
>>>>
>>>> Thanks for the explanation.
>>>>
>>>> Thanks for the proposal!
>>>>>
>>>>> Jan
>>>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote:
>>>>>
>>>>> I want to highlight that this design works for definitely more runners
>>>>> than just Dataflow. I see two pieces of it that I want to bring onto the
>>>>> thread:
>>>>>
>>>>> 1. A new kind of "unbounded source" which is a periodic refresh of a
>>>>> bounded source, and use that as a side input. Each main input element has 
>>>>> a
>>>>> window that maps to a specific refresh of the side input.
>>>>> 2. Distributed map side inputs: supporting very large lookup tables,
>>>>> but with consistency challenges. Even the part about "windmill API"
>>>>> probably applies to other runners
>>>>>
>>>>> So I hope the title and "Objective" section do not cause people to
>>>>> stop reading.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> +some people explicitly
>>>>>>
>>>>>> Can you please check on the doc and comment if it looks fine?
>>>>>>
>>>>>> Thank you,
>>>>>> --Mikhail
>>>>>>
>>>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> "Good news, everyone-"
>>>>>>> ―Farnsworth
>>>>>>>
>>>>>>> Hi everyone,
>>>>>>>
>>>>>>> Recently, I was looking into relaxing limitations on side inputs in
>>>>>>> Dataflow runner. As part of it, I came up with design proposal for
>>>>>>> standardizing slowly changing dimensions use case in Beam and relevant
>>>>>>> changes to add support for distributed map side inputs.
>>>>>>>
>>>>>>> Please review and comment on design doc.
>>>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg>
>>>>>>>  [1]
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Mikhail.
>>>>>>>
>>>>>>> -----
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg
>>>>>>>
>>>>>>>
>
> --
>
> This email may be confidential and privileged. If you received this
> communication by mistake, please don't forward it to anyone else, please
> erase all copies and attachments, and please let me know that it has gone
> to the wrong person.
>
> The above terms reflect a potential business arrangement, are provided
> solely as a basis for further discussion, and are not intended to be and do
> not constitute a legally binding obligation. No legally binding obligations
> will be created, implied, or inferred until an agreement in final form is
> executed in writing by all parties involved.
>

Re: [Proposal] Slowly Changing Dimensions and Distributed Map Side Inputs (in Dataflow)

Reply via email to