Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

jincheng sun Mon, 11 Nov 2019 05:30:11 -0800

Thank you all for your feedbacks.

@Kenn, @Robert I got what you mean now. I think it's good to make
window/timestamp/paneinfo values configurable while also avoid redundant
encoding and decoding.


Besides, I think the idea from @Luke is good, that we can replace
WindowedValue<T> with T in some cases to reduce the overhead of
understanding. But maybe we can solve it in a seperate issue?

I will update the PR according to option 2 (Add a windowed value coder
whose window/timestamp/paneinfo values are specified as constants). Thanks
again.

Best,
Jincheng

Kenneth Knowles <[email protected]>于2019年11月9日 周六02:59写道：

>
>
> On Fri, Nov 8, 2019 at 9:23 AM Luke Cwik <[email protected]> wrote:
>
>>
>>
>> On Thu, Nov 7, 2019 at 7:36 PM Kenneth Knowles <[email protected]> wrote:
>>
>>>
>>>
>>> On Thu, Nov 7, 2019 at 9:19 AM Luke Cwik <[email protected]> wrote:
>>>
>>>> I did suggest one other alternative on Jincheng's PR[1] which was to
>>>> allow windowless values to be sent across the gRPC port. The SDK would then
>>>> be responsible for ensuring that the execution didn't access any properties
>>>> that required knowledge of the timestamp, pane or window. This is different
>>>> then adding the ValueOnlyWindowedValueCoder as a model coder because it
>>>> allows SDKs to pass around raw values to functions without any windowing
>>>> overhead which could be useful for things like the side input window
>>>> mapping or window merging functions we have.
>>>>
>>>
>>> When you say "pass around" what does it mean? If it is over the wire,
>>> there is already no overhead to ValueOnlyWindowedValueCoder. So do you mean
>>> the overhead of having the layer of boxing of WindowedValue? I would assume
>>> all non-value components of the WindowedValue from
>>> ValueOnlyWindowedValueCoder are pointers to a single shared immutable
>>> instance carried by the coder instance.
>>>
>>
>> I was referring to the layer of boxing of WindowedValue. My concern
>> wasn't the performance overhead of passing around a wrapper object but the
>> cognitive overhead of understanding why everything needs to be wrapped in a
>> windowed value. Since you have been working on SQL for some time, this
>> would be analogous to executing a UDF and making all the machinery around
>> it take WindowedValue<T> instead of T.
>>
>>
>>> I think raw values can already be passed to functions, no? The main
>>> thing is that elements in a PCollection always have a window, timestamp,
>>> and paneinfo. Not all values are elements. Is there a specific case you
>>> have in mind? I would not expect WindowMappingFn or window merging fn to be
>>> passing "elements" but just values of the appropriate type for the function.
>>>
>>
>> This is about the machinery around WindowMappingFn/WindowMergingFn. For
>> example the implementation around WindowMappingFn takes a
>> WindowedValue<Window> and unwraps it forwarding it to the WindowMappingFn
>> and then takes the result and wraps it in a WindowedValue<Window> and
>> returns that to the runner.
>>
>
> I'm not familiar with this, but it sounds like it should not be necessary
> and is an implementation detail. Is there a model change necessary to avoid
> the unboxing/boxing? I would be surprised.
>
> Kenn
>
>
>>
>>
>>>
>>>
>>>> On Thu, Nov 7, 2019 at 8:48 AM Robert Bradshaw <[email protected]>
>>>> wrote:
>>>>
>>>>> I think there is some misunderstanding about what is meant by option
>>>>> 2. What Kenn (I think) and I are proposing is not a WindowedValueCoder
>>>>> whose window/timestamp/paneinfo coders are parameterized to be
>>>>> constant coders, but a WindowedValueCoder whose
>>>>> window/timestamp/paneinfo values are specified as constants in the
>>>>> coder.
>>>>>
>>>>> Let's call this NewValueOnlyWindowedValueCoder, and is parameterized
>>>>> by a window, timestamp, and pane info instance
>>>>>
>>>>> The existing ValueOnlyWindowedValueCoder is literally
>>>>> NewValueOnlyWindowedValueCoder(GlobalWindow, MIN_TIMESTAMP,
>>>>> PaneInfo.NO_FIRING). Note in particular that using the existing
>>>>> ValueOnlyWindowedValueCoder would give the wrong timestamp and pane
>>>>> info if it is use for the result of a GBK, which I think is the loss
>>>>> of consistency referred to here.
>>>>>
>>>>
>>> Yes, this is exactly what I am proposing and sounds like what "approach
>>> 2" is. I think this approach "just works". It is simpler and more efficient
>>> than "WindowedValueCoder.of(ValueCoder, WindowCoder, TimestampCoder,
>>> PaneInfoCoder)" which would require some overhead even if WindowCoder,
>>> TimestampCoder and PaneInfoCoder were constant coders consuming zero bytes.
>>>
>>> Kenn
>>>
>>>
>>>
>>>> On Thu, Nov 7, 2019 at 1:03 AM jincheng sun <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > Thanks for your feedback and the valuable comments, Kenn & Robert!
>>>>> >
>>>>> > I think your comments are more comprehensive and enlighten me a lot.
>>>>> The two proposals which I mentioned above are to reuse the existing coder
>>>>> (FullWindowedValueCoder and ValueOnlyWindowedValueCoder). Now, with your
>>>>> comments, I think we can further abstract 'FullWindowedValueCoder' and
>>>>> 'ValueOnlyWindowedValueCoder', that is, we can rename
>>>>> 'FullWindowedValueCoder' to 'WindowedValueCoder', and make the coders of
>>>>> window/timestamp/pane configurable. Then we can remove
>>>>> 'ValueOnlyWindowedValueCoder' and need to add a mount of constant coders
>>>>> for window/timestamp/pane.
>>>>> >
>>>>> > I have replied your comments on the doc, and quick feedback as
>>>>> following:
>>>>> >
>>>>> > Regarding to "Approach 2: probably no SDK harness work / compatible
>>>>> with existing Beam model so no risk of introducing inconsistency",if we
>>>>> "just puts default window/timestamp/pane info on elements" and don't 
>>>>> change
>>>>> the original coder, then the performance is not optimized. If we want to
>>>>> get the best performance, then the default coder of Window/timestamp/pane
>>>>> should be constant coder. In this case the SDK harnesses need to be aware
>>>>> of the constant coder and there will be some development work in the SDK
>>>>> harness. Besides, the SDK harness also needs to make the coders for
>>>>> window/timestamp/pane configurable and this will introduce some related
>>>>> changes, such as updating WindowedValueCoder._get_component_coders, etc.
>>>>> >
>>>>> > Regarding to "Approach 1: option a: if the SDK harness has to
>>>>> understand 'values without windows' then very large changes and high risk
>>>>> of introducing inconsistency (I eliminated many of these 
>>>>> inconsistencies)",
>>>>> we only need to add ValueOnlyWindowedValueCoder to the StandardCoders and
>>>>> all the SDK harness should be aware of this coder. There is no much 
>>>>> changes
>>>>> actually.
>>>>> >
>>>>> > Please feel free to correct me if there is anyting incorrect. :)
>>>>> >
>>>>> > Besides, I'm not quite clear about the consistency issues you meant
>>>>> here. Could you please give me some hints about this?
>>>>> >
>>>>> > Best,
>>>>> > Jincheng
>>>>> >
>>>>> > Robert Bradshaw <[email protected]> 于2019年11月7日周四 上午3:38写道：
>>>>> >>
>>>>> >> Yes, the portability framework is designed to support this, and
>>>>> >> possibly even more efficient transfers of data than
>>>>> element-by-element
>>>>> >> as per the wire coder specified in the IO port operators. I left
>>>>> some
>>>>> >> comments on the doc as well, and would also prefer approach 2.
>>>>> >>
>>>>> >> On Wed, Nov 6, 2019 at 11:03 AM Kenneth Knowles <[email protected]>
>>>>> wrote:
>>>>> >> >
>>>>> >> > I think the portability framework is designed for this. The
>>>>> runner controls the coder on the grpc ports and the runner controls the
>>>>> process bundle descriptor.
>>>>> >> >
>>>>> >> > I commented on the doc. I think what is missing is analysis of
>>>>> scope of SDK harness changes and risk to model consistency
>>>>> >> >
>>>>> >> >     Approach 2: probably no SDK harness work / compatible with
>>>>> existing Beam model so no risk of introducing inconsistency
>>>>> >> >
>>>>> >> >     Approach 1: what are all the details?
>>>>> >> >         option a: if the SDK harness has to understand "values
>>>>> without windows" then very large changes and high risk of introducing
>>>>> inconsistency (I eliminated many of these inconsistencies)
>>>>> >> >         option b: if the coder just puts default
>>>>> window/timestamp/pane info on elements, then it is the same as approach 2,
>>>>> no work / no risk
>>>>> >> >
>>>>> >> > Kenn
>>>>> >> >
>>>>> >> > On Wed, Nov 6, 2019 at 1:09 AM jincheng sun <
>>>>> [email protected]> wrote:
>>>>> >> >>
>>>>> >> >> Hi all,
>>>>> >> >>
>>>>> >> >> I am trying to make some improvements of portability framework
>>>>> to make it usable in other projects. However, we find that the coder
>>>>> between runner and harness can only be FullWindowedValueCoder. This means
>>>>> each time when sending a WindowedValue, we have to encode/decode 
>>>>> timestamp,
>>>>> windows and pan infos. In some circumstances(such as using the portability
>>>>> framework in Flink), only values are needed between runner and harness. 
>>>>> So,
>>>>> it would be nice if we can configure the coder and avoid redundant 
>>>>> encoding
>>>>> and decoding between runner and harness to improve the performance.
>>>>> >> >>
>>>>> >> >> There are two approaches to solve this issue:
>>>>> >> >>
>>>>> >> >>     Approach 1:  Support ValueOnlyWindowedValueCoder between
>>>>> runner and harness.
>>>>> >> >>     Approach 2:  Add a "constant" window coder that embeds all
>>>>> the windowing information as part of the coder that should be used to wrap
>>>>> the value during decoding.
>>>>> >> >>
>>>>> >> >> More details can be found here [1].
>>>>> >> >>
>>>>> >> >> As of the shortcomings of “Approach 2” which still need to
>>>>> encode/decode timestamp and pane infos, we tend to choose “Approach 1”
>>>>> which brings better performance and is more thorough.
>>>>> >> >>
>>>>> >> >> Welcome any feedback :)
>>>>> >> >>
>>>>> >> >> Best,
>>>>> >> >> Jincheng
>>>>> >> >>
>>>>> >> >> [1]
>>>>> https://docs.google.com/document/d/1TTKZC6ppVozG5zV5RiRKXse6qnJl-EsHGb_LkUfoLxY/edit?usp=sharing
>>>>> >> >>
>>>>>
>>>>

Re: [DISCUSS] Avoid redundant encoding and decoding between runner and harness

Reply via email to