UPD: I have updated doc with API suggestions, please check on relevant section of the doc [1] <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732>
--Mikhail [1] https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg/edit#heading=h.5e78hch3k732 On Thu, Jan 16, 2020 at 2:52 AM Reza Rokni <r...@google.com> wrote: > +1 To this proposal, this is a very common pattern requirement from users. > With the following current workaround having seen a lot of traction: > > > https://beam.apache.org/documentation/patterns/side-inputs/#slowly-updating-global-window-side-inputs > > Making this process simpler for users and Out Of the Box, would be a great > win! > > I would also mention that ideally we will also cover the large distributed > side inputs, but a lot of the core cases for this comes down to Side inputs > that do fit in memory. Perhaps worth putting priorities on the work with > the smaller side input tables having precedence. Unless the work will cover > both cases in the same way of course. > > Cheers > > Reza > > On Thu, 19 Dec 2019 at 07:14, Kenneth Knowles <k...@apache.org> wrote: > >> I do think that the implementation concerns around larger side inputs are >> relevant to most runners. Ideally there would be no model change necessary. >> Triggers are harder and bring in consistency concerns, which are even more >> likely to be relevant to all runners. >> >> Kenn >> >> On Wed, Dec 18, 2019 at 11:23 AM Luke Cwik <lc...@google.com> wrote: >> >>> Most of the doc is about how to support distributed side inputs in >>> Dataflow and doesn't really cover how the Beam model (accumulating, >>> discarding, retraction) triggers impact what are the "contents" of a >>> PCollection in time and how this proposal for a limited set of side input >>> shapes can work to support larger side inputs in Dataflow. >>> >>> On Tue, Dec 17, 2019 at 2:28 AM Jan Lukavský <je...@seznam.cz> wrote: >>> >>>> Hi Mikhail, >>>> On 12/17/19 10:43 AM, Mikhail Gryzykhin wrote: >>>> >>>> inline >>>> >>>> On Tue, Dec 17, 2019 at 12:59 AM Jan Lukavský <je...@seznam.cz> wrote: >>>> >>>>> Hi, >>>>> >>>>> I actually thought that the proposal refers to Dataflow only. If this >>>>> is supposed to be general, can we remove the Dataflow/Windmill specific >>>>> parts and replace them with generic ones? >>>>> >>>> I'll look into rephrasing doc to keep Dataflow/Windmill as example. >>>> >>>> Cool, thanks! >>>> >>>> I'd have two more questions: >>>>> >>>>> a) the proposal is named "Slowly changing", why is the rate of change >>>>> essential to the proposal? Once running on event time, that should not >>>>> matter, or what am I missing? >>>>> >>>> Within this proposal, it is suggested to make a full snapshot of data >>>> on every re-read. This is generally expensive and setting time event to >>>> short interval might cause issues. Otherwise it is not essential. >>>> >>>> Understood. This relates to table-stream duality, where the >>>> requirements might relax once you don't have to convert table to stream by >>>> re-reading it, but by being able to retrieve updates as you go (example >>>> would be reading directly from kafka or any other "commit log" >>>> abstraction). >>>> >>>> b) The description says: 'User wants to solve a stream enrichment >>>>> problem. In brief request sounds like: ”I want to enrich each event in >>>>> this >>>>> stream by corresponding data from given table.”'. That is understandable, >>>>> but would it be better to enable the user to express this intent directly >>>>> (via Join operation)? The actual implementation might be runner (and >>>>> input!) specific. The analogy is that when doing group-by-key operation, >>>>> runner can choose hash grouping or sort-merge grouping, but that is not >>>>> (directly) expressed in user code. I'm not saying that we should not have >>>>> low-level transforms, just asking if it would be better to leave this >>>>> decision to the runner (at least in some cases). It might be the case that >>>>> we want to make core SDK as low level as possible (and as reasonable), I >>>>> just want to make sure that that is really the intent. >>>>> >>>> The idea is to add basic operation with as small change as possible for >>>> current API. >>>> Ultimate goal is to have a Join/GBK operator that will choose proper >>>> strategy. However, I don't think that we have proper tools and view of how >>>> to choose best strategy at hand as of yet. >>>> >>>> OK, cool. That is where I would find it very much useful to have some >>>> sort of "goals", that we are targeting. I agree that there are some pieces >>>> missing in the puzzle as of now. But it would be good to know what these >>>> pieces are and what needs to be done to fulfill our goals. But this is >>>> probably not related to discussion of this proposal, but more related to >>>> the concept of BIP or similar. >>>> >>>> Thanks for the explanation. >>>> >>>> Thanks for the proposal! >>>>> >>>>> Jan >>>>> On 12/17/19 12:01 AM, Kenneth Knowles wrote: >>>>> >>>>> I want to highlight that this design works for definitely more runners >>>>> than just Dataflow. I see two pieces of it that I want to bring onto the >>>>> thread: >>>>> >>>>> 1. A new kind of "unbounded source" which is a periodic refresh of a >>>>> bounded source, and use that as a side input. Each main input element has >>>>> a >>>>> window that maps to a specific refresh of the side input. >>>>> 2. Distributed map side inputs: supporting very large lookup tables, >>>>> but with consistency challenges. Even the part about "windmill API" >>>>> probably applies to other runners >>>>> >>>>> So I hope the title and "Objective" section do not cause people to >>>>> stop reading. >>>>> >>>>> Kenn >>>>> >>>>> On Mon, Dec 16, 2019 at 11:36 AM Mikhail Gryzykhin <mig...@google.com> >>>>> wrote: >>>>> >>>>>> +some people explicitly >>>>>> >>>>>> Can you please check on the doc and comment if it looks fine? >>>>>> >>>>>> Thank you, >>>>>> --Mikhail >>>>>> >>>>>> On Tue, Dec 10, 2019 at 1:43 PM Mikhail Gryzykhin <mig...@google.com> >>>>>> wrote: >>>>>> >>>>>>> "Good news, everyone-" >>>>>>> ―Farnsworth >>>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> Recently, I was looking into relaxing limitations on side inputs in >>>>>>> Dataflow runner. As part of it, I came up with design proposal for >>>>>>> standardizing slowly changing dimensions use case in Beam and relevant >>>>>>> changes to add support for distributed map side inputs. >>>>>>> >>>>>>> Please review and comment on design doc. >>>>>>> <https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg> >>>>>>> [1] >>>>>>> >>>>>>> Thank you, >>>>>>> Mikhail. >>>>>>> >>>>>>> ----- >>>>>>> >>>>>>> [1] >>>>>>> https://docs.google.com/document/d/1LDY_CtsOJ8Y_zNv1QtkP6AGFrtzkj1q5EW_gSChOIvg >>>>>>> >>>>>>> > > -- > > This email may be confidential and privileged. If you received this > communication by mistake, please don't forward it to anyone else, please > erase all copies and attachments, and please let me know that it has gone > to the wrong person. > > The above terms reflect a potential business arrangement, are provided > solely as a basis for further discussion, and are not intended to be and do > not constitute a legally binding obligation. No legally binding obligations > will be created, implied, or inferred until an agreement in final form is > executed in writing by all parties involved. >