Thank you all for your feedbacks. @Kenn, @Robert I got what you mean now. I think it's good to make window/timestamp/paneinfo values configurable while also avoid redundant encoding and decoding.
Besides, I think the idea from @Luke is good, that we can replace WindowedValue<T> with T in some cases to reduce the overhead of understanding. But maybe we can solve it in a seperate issue? I will update the PR according to option 2 (Add a windowed value coder whose window/timestamp/paneinfo values are specified as constants). Thanks again. Best, Jincheng Kenneth Knowles <[email protected]>于2019年11月9日 周六02:59写道: > > > On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik <[email protected]> wrote: > >> >> >> On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <[email protected]> wrote: >> >>> >>> >>> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <[email protected]> wrote: >>> >>>> I did suggest one other alternative on Jincheng's PR[1] which was to >>>> allow windowless values to be sent across the gRPC port. The SDK would then >>>> be responsible for ensuring that the execution didn't access any properties >>>> that required knowledge of the timestamp, pane or window. This is different >>>> then adding the ValueOnlyWindowedValueCoder as a model coder because it >>>> allows SDKs to pass around raw values to functions without any windowing >>>> overhead which could be useful for things like the side input window >>>> mapping or window merging functions we have. >>>> >>> >>> When you say "pass around" what does it mean? If it is over the wire, >>> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean >>> the overhead of having the layer of boxing of WindowedValue? I would assume >>> all non-value components of the WindowedValue from >>> ValueOnlyWindowedValueCoder are pointers to a single shared immutable >>> instance carried by the coder instance. >>> >> >> I was referring to the layer of boxing of WindowedValue. My concern >> wasn't the performance overhead of passing around a wrapper object but the >> cognitive overhead of understanding why everything needs to be wrapped in a >> windowed value. Since you have been working on SQL for some time, this >> would be analogous to executing a UDF and making all the machinery around >> it take WindowedValue<T> instead of T. >> >> >>> I think raw values can already be passed to functions, no? The main >>> thing is that elements in a PCollection always have a window, timestamp, >>> and paneinfo. Not all values are elements. Is there a specific case you >>> have in mind? I would not expect WindowMappingFn or window merging fn to be >>> passing "elements" but just values of the appropriate type for the function. >>> >> >> This is about the machinery around WindowMappingFn/WindowMergingFn. For >> example the implementation around WindowMappingFn takes a >> WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn >> and then takes the result and wraps it in a WindowedValue<Window> and >> returns that to the runner. >> > > I'm not familiar with this, but it sounds like it should not be necessary > and is an implementation detail. Is there a model change necessary to avoid > the unboxing/boxing? I would be surprised. > > Kenn > > >> >> >>> >>> >>>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <[email protected]> >>>> wrote: >>>> >>>>> I think there is some misunderstanding about what is meant by option >>>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder >>>>> whose window/timestamp/paneinfo coders are parameterized to be >>>>> constant coders, but a WindowedValueCoder whose >>>>> window/timestamp/paneinfo values are specified as constants in the >>>>> coder. >>>>> >>>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized >>>>> by a window, timestamp, and pane info instance >>>>> >>>>> The existing ValueOnlyWindowedValueCoder is literally >>>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP, >>>>> PaneInfo.NO_FIRING). Note in particular that using the existing >>>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane >>>>> info if it is use for the result of a GBK, which I think is the loss >>>>> of consistency referred to here. >>>>> >>>> >>> Yes, this is exactly what I am proposing and sounds like what "approach >>> 2" is. I think this approach "just works". It is simpler and more efficient >>> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder, >>> PaneInfoCoder)" which would require some overhead even if WindowCoder, >>> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes. >>> >>> Kenn >>> >>> >>> >>>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <[email protected]> >>>>> wrote: >>>>> > >>>>> > Thanks for your feedback and the valuable comments, Kenn & Robert! >>>>> > >>>>> > I think your comments are more comprehensive and enlighten me a lot. >>>>> The two proposals which I mentioned above are to reuse the existing coder >>>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your >>>>> comments, I think we can further abstract 'FullWindowedValueCoder' and >>>>> 'ValueOnlyWindowedValueCoder', that is, we can rename >>>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of >>>>> window/timestamp/pane configurable. Then we can remove >>>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders >>>>> for window/timestamp/pane. >>>>> > >>>>> > I have replied your comments on the doc, and quick feedback as >>>>> following: >>>>> > >>>>> > Regarding to "Approach 2: probably no SDK harness work / compatible >>>>> with existing Beam model so no risk of introducing inconsistency",if we >>>>> "just puts default window/timestamp/pane info on elements" and don't >>>>> change >>>>> the original coder, then the performance is not optimized. If we want to >>>>> get the best performance, then the default coder of Window/timestamp/pane >>>>> should be constant coder. In this case the SDK harnesses need to be aware >>>>> of the constant coder and there will be some development work in the SDK >>>>> harness. Besides, the SDK harness also needs to make the coders for >>>>> window/timestamp/pane configurable and this will introduce some related >>>>> changes, such as updating WindowedValueCoder._get_component_coders, etc. >>>>> > >>>>> > Regarding to "Approach 1: option a: if the SDK harness has to >>>>> understand 'values without windows' then very large changes and high risk >>>>> of introducing inconsistency (I eliminated many of these >>>>> inconsistencies)", >>>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and >>>>> all the SDK harness should be aware of this coder. There is no much >>>>> changes >>>>> actually. >>>>> > >>>>> > Please feel free to correct me if there is anyting incorrect. :) >>>>> > >>>>> > Besides, I'm not quite clear about the consistency issues you meant >>>>> here. Could you please give me some hints about this? >>>>> > >>>>> > Best, >>>>> > Jincheng >>>>> > >>>>> > Robert Bradshaw <[email protected]> 于2019年11月7日周四 上午3:38写道: >>>>> >> >>>>> >> Yes, the portability framework is designed to support this, and >>>>> >> possibly even more efficient transfers of data than >>>>> element-by-element >>>>> >> as per the wire coder specified in the IO port operators. I left >>>>> some >>>>> >> comments on the doc as well, and would also prefer approach 2. >>>>> >> >>>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <[email protected]> >>>>> wrote: >>>>> >> > >>>>> >> > I think the portability framework is designed for this. The >>>>> runner controls the coder on the grpc ports and the runner controls the >>>>> process bundle descriptor. >>>>> >> > >>>>> >> > I commented on the doc. I think what is missing is analysis of >>>>> scope of SDK harness changes and risk to model consistency >>>>> >> > >>>>> >> > Approach 2: probably no SDK harness work / compatible with >>>>> existing Beam model so no risk of introducing inconsistency >>>>> >> > >>>>> >> > Approach 1: what are all the details? >>>>> >> > option a: if the SDK harness has to understand "values >>>>> without windows" then very large changes and high risk of introducing >>>>> inconsistency (I eliminated many of these inconsistencies) >>>>> >> > option b: if the coder just puts default >>>>> window/timestamp/pane info on elements, then it is the same as approach 2, >>>>> no work / no risk >>>>> >> > >>>>> >> > Kenn >>>>> >> > >>>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun < >>>>> [email protected]> wrote: >>>>> >> >> >>>>> >> >> Hi all, >>>>> >> >> >>>>> >> >> I am trying to make some improvements of portability framework >>>>> to make it usable in other projects. However, we find that the coder >>>>> between runner and harness can only be FullWindowedValueCoder. This means >>>>> each time when sending a WindowedValue, we have to encode/decode >>>>> timestamp, >>>>> windows and pan infos. In some circumstances(such as using the portability >>>>> framework in Flink), only values are needed between runner and harness. >>>>> So, >>>>> it would be nice if we can configure the coder and avoid redundant >>>>> encoding >>>>> and decoding between runner and harness to improve the performance. >>>>> >> >> >>>>> >> >> There are two approaches to solve this issue: >>>>> >> >> >>>>> >> >> Approach 1: Support ValueOnlyWindowedValueCoder between >>>>> runner and harness. >>>>> >> >> Approach 2: Add a "constant" window coder that embeds all >>>>> the windowing information as part of the coder that should be used to wrap >>>>> the value during decoding. >>>>> >> >> >>>>> >> >> More details can be found here [1]. >>>>> >> >> >>>>> >> >> As of the shortcomings of “Approach 2” which still need to >>>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1” >>>>> which brings better performance and is more thorough. >>>>> >> >> >>>>> >> >> Welcome any feedback :) >>>>> >> >> >>>>> >> >> Best, >>>>> >> >> Jincheng >>>>> >> >> >>>>> >> >> [1] >>>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing >>>>> >> >> >>>>> >>>>
