On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <k...@apache.org> wrote:
> > > On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <lc...@google.com> wrote: > >> I did suggest one other alternative on Jincheng's PR[1] which was to >> allow windowless values to be sent across the gRPC port. The SDK would then >> be responsible for ensuring that the execution didn't access any properties >> that required knowledge of the timestamp, pane or window. This is different >> then adding the ValueOnlyWindowedValueCoder as a model coder because it >> allows SDKs to pass around raw values to functions without any windowing >> overhead which could be useful for things like the side input window >> mapping or window merging functions we have. >> > > When you say "pass around" what does it mean? If it is over the wire, > there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean > the overhead of having the layer of boxing of WindowedValue? I would assume > all non-value components of the WindowedValue from > ValueOnlyWindowedValueCoder are pointers to a single shared immutable > instance carried by the coder instance. > I was referring to the layer of boxing of WindowedValue. My concern wasn't the performance overhead of passing around a wrapper object but the cognitive overhead of understanding why everything needs to be wrapped in a windowed value. Since you have been working on SQL for some time, this would be analogous to executing a UDF and making all the machinery around it take WindowedValue<T> instead of T. > I think raw values can already be passed to functions, no? The main thing > is that elements in a PCollection always have a window, timestamp, and > paneinfo. Not all values are elements. Is there a specific case you have in > mind? I would not expect WindowMappingFn or window merging fn to be passing > "elements" but just values of the appropriate type for the function. > This is about the machinery around WindowMappingFn/WindowMergingFn. For example the implementation around WindowMappingFn takes a WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn and then takes the result and wraps it in a WindowedValue<Window> and returns that to the runner. > > >> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <rober...@google.com> >> wrote: >> >>> I think there is some misunderstanding about what is meant by option >>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder >>> whose window/timestamp/paneinfo coders are parameterized to be >>> constant coders, but a WindowedValueCoder whose >>> window/timestamp/paneinfo values are specified as constants in the >>> coder. >>> >>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized >>> by a window, timestamp, and pane info instance >>> >>> The existing ValueOnlyWindowedValueCoder is literally >>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP, >>> PaneInfo.NO_FIRING). Note in particular that using the existing >>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane >>> info if it is use for the result of a GBK, which I think is the loss >>> of consistency referred to here. >>> >> > Yes, this is exactly what I am proposing and sounds like what "approach 2" > is. I think this approach "just works". It is simpler and more efficient > than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder, > PaneInfoCoder)" which would require some overhead even if WindowCoder, > TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes. > > Kenn > > > >> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <sunjincheng...@gmail.com> >>> wrote: >>> > >>> > Thanks for your feedback and the valuable comments, Kenn & Robert! >>> > >>> > I think your comments are more comprehensive and enlighten me a lot. >>> The two proposals which I mentioned above are to reuse the existing coder >>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your >>> comments, I think we can further abstract 'FullWindowedValueCoder' and >>> 'ValueOnlyWindowedValueCoder', that is, we can rename >>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of >>> window/timestamp/pane configurable. Then we can remove >>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders >>> for window/timestamp/pane. >>> > >>> > I have replied your comments on the doc, and quick feedback as >>> following: >>> > >>> > Regarding to "Approach 2: probably no SDK harness work / compatible >>> with existing Beam model so no risk of introducing inconsistency",if we >>> "just puts default window/timestamp/pane info on elements" and don't change >>> the original coder, then the performance is not optimized. If we want to >>> get the best performance, then the default coder of Window/timestamp/pane >>> should be constant coder. In this case the SDK harnesses need to be aware >>> of the constant coder and there will be some development work in the SDK >>> harness. Besides, the SDK harness also needs to make the coders for >>> window/timestamp/pane configurable and this will introduce some related >>> changes, such as updating WindowedValueCoder._get_component_coders, etc. >>> > >>> > Regarding to "Approach 1: option a: if the SDK harness has to >>> understand 'values without windows' then very large changes and high risk >>> of introducing inconsistency (I eliminated many of these inconsistencies)", >>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and >>> all the SDK harness should be aware of this coder. There is no much changes >>> actually. >>> > >>> > Please feel free to correct me if there is anyting incorrect. :) >>> > >>> > Besides, I'm not quite clear about the consistency issues you meant >>> here. Could you please give me some hints about this? >>> > >>> > Best, >>> > Jincheng >>> > >>> > Robert Bradshaw <rober...@google.com> 于2019年11月7日周四 上午3:38写道: >>> >> >>> >> Yes, the portability framework is designed to support this, and >>> >> possibly even more efficient transfers of data than element-by-element >>> >> as per the wire coder specified in the IO port operators. I left some >>> >> comments on the doc as well, and would also prefer approach 2. >>> >> >>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <k...@apache.org> >>> wrote: >>> >> > >>> >> > I think the portability framework is designed for this. The runner >>> controls the coder on the grpc ports and the runner controls the process >>> bundle descriptor. >>> >> > >>> >> > I commented on the doc. I think what is missing is analysis of >>> scope of SDK harness changes and risk to model consistency >>> >> > >>> >> > Approach 2: probably no SDK harness work / compatible with >>> existing Beam model so no risk of introducing inconsistency >>> >> > >>> >> > Approach 1: what are all the details? >>> >> > option a: if the SDK harness has to understand "values >>> without windows" then very large changes and high risk of introducing >>> inconsistency (I eliminated many of these inconsistencies) >>> >> > option b: if the coder just puts default >>> window/timestamp/pane info on elements, then it is the same as approach 2, >>> no work / no risk >>> >> > >>> >> > Kenn >>> >> > >>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun < >>> sunjincheng...@gmail.com> wrote: >>> >> >> >>> >> >> Hi all, >>> >> >> >>> >> >> I am trying to make some improvements of portability framework to >>> make it usable in other projects. However, we find that the coder between >>> runner and harness can only be FullWindowedValueCoder. This means each time >>> when sending a WindowedValue, we have to encode/decode timestamp, windows >>> and pan infos. In some circumstances(such as using the portability >>> framework in Flink), only values are needed between runner and harness. So, >>> it would be nice if we can configure the coder and avoid redundant encoding >>> and decoding between runner and harness to improve the performance. >>> >> >> >>> >> >> There are two approaches to solve this issue: >>> >> >> >>> >> >> Approach 1: Support ValueOnlyWindowedValueCoder between >>> runner and harness. >>> >> >> Approach 2: Add a "constant" window coder that embeds all the >>> windowing information as part of the coder that should be used to wrap the >>> value during decoding. >>> >> >> >>> >> >> More details can be found here [1]. >>> >> >> >>> >> >> As of the shortcomings of “Approach 2” which still need to >>> encode/decode timestamp and pane infos, we tend to choose “Approach 1” >>> which brings better performance and is more thorough. >>> >> >> >>> >> >> Welcome any feedback :) >>> >> >> >>> >> >> Best, >>> >> >> Jincheng >>> >> >> >>> >> >> [1] >>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing >>> >> >> >>> >>