On Fri, Jun 19, 2020 at 11:53 AM Robert Burke <rob...@frantil.com> wrote:

> Marvelous! Thank you Luke!
>  I do like having less work to do, while finding out the Go SDK is doing
> things the right way already (if not yet completely).
>
> Given that the portable coders are always in the "nested" context, it's
> reasonable that we don't mention them in the proto. I'm less concerned
> about removing the test cases for them right now, as it sounds like those
> situations still come up in the legacy executions of the Python and Java
> SDKs.
>
> The action items which I'm taking on as part of BEAM-3203 :
>
> 1. Document in the proto file, the details of encodings that are missing:
> bytes, string, kv, bool, global window, interval window, length prefix
>

Some of this may have already been done in beam_runner_api.proto as I know
that a few folks have improved this in the past couple of months.


> 2. Document the redundancy of the LP(bytes)/bytes and LP(string)/string
> situation. If i understand correctly, when an SDK sees either of these
> coders, they're always only expecting a single length prefix for the
> following byte stream.
>

LP(bytes) and LP(string) should have one length prefix followed by whatever
the correct "nested" encoding for string/bytes is. So in practice this
turns into two length prefixes. LP('abc') should be '4' '3' 'abc'. The '4'
represents how many bytes the element takes and the '3' is an internal
implementation detail of the string coder so that it knows how many bytes
it has to decode. The encoding for LP(X) and X are never the same.


> 3. Mention in the proto that test cases for the coders are in
> standard_coders.yaml
>
4. In standard_coders.yaml, clearly document what Nested meant and that
> Portable SDKs only need to handle "Nested" examples.
> 5. Add a test case(s) for  LP coders as they don't exist yet.
>

Sounds good for 3-5.


>
> That last one will be done separately, but still falls into documentation,
> as test cases are documentation when commented effectively.
>
> Another separate nit related to the standard_coders.yaml that i noticed:
> the encoded example bytes are not in Latin1. They're the bytes
> *interpreted* as latin1, and then written out. Initially i was decoding
> those bytes as latin1 and the tests were all failing. Instead one needs to
> re-encode the bytes as latin1 to get the expected bytes for comparison.
> This is a separate larger task since the yaml examples would need to be
> re-encoded and each of the SDK tests
> that use it would need to be adjusted for the correct use.
>

+1, the interpretation of the YAML representation is not well documented
but it might be a well known property of all YAML files.


>
> In the meantime, I'll document the correct procedure for SDK authors to
> take to use the file correctly.
>
> Thanks again!
>
>
>
> On Fri, Jun 19, 2020, 11:22 AM Luke Cwik <lc...@google.com> wrote:
>
>> Thanks for offering to help with the documentation (BEAM-3203). The high
>> level documentation about this is here[1] and the beam_runner_api.proto
>> also contains a bunch of information related to coders.
>>
>> Length prefixing is not the same as the nesting coder property in
>> Java/Python and we have been trying to get rid of that property but due to
>> backwards compatibility it can only be marked deprecated. Internally within
>> the Python/Java SDKs, we use coders for producing output to other things
>> like files which is why the unnested (aka outer) coder context exists. We
>> should not have used the coder concept for formatting/parsing IO and also
>> as an intermediate representation for state/PCollections/.... All the
>> portable Beam APIs are defined to use the nested encoding and any which
>> aren't is a bug. I would suggest only testing the nested variants in
>> standard_coders.yaml and deleting the unnested variants of tests (I'm not
>> certain as to why they exist in the first place).
>>
>> As for the wrapping issue you brought up, anywhere a coder is referenced
>> by a PTransform in the pipeline graph, then the associated environment is
>> expected to understand all the component encodings. So if a
>> LengthPrefix<String> coder is in the graph referenced by transform X with
>> environment Y then environment Y is expected to know the encoding for
>> LengthPrefix and String and should be able to accept either
>> LengthPrefix<String> or String as input or output. This currently appears
>> on the data API (elements, timers) and state API (user state, side inputs)
>> and also providing the encoded element during splitting and finally during
>> graph expansion for certain well known transforms like combiners and
>> splittable DoFns.
>>
>> 1:
>> https://docs.google.com/document/d/1IGduUqmhWDi_69l9nG8kw73HZ5WI5wOps9Tshl5wpQA/edit#heading=h.g19rkupi8zga
>>
>> On Fri, Jun 19, 2020 at 10:49 AM Robert Burke <r...@google.com> wrote:
>>
>>> Hello dev@! I have questions.
>>>
>>> Context: I'm adding Beam Schemas to the Go SDK, and on the way I'm
>>> validating the Go SDK coders against standard_coders.yaml, per BEAM-7009[0].
>>>
>>> *When is it reasonable for a runner to send an SDK an unnested byte or
>>> string coder (AKA, no length prefixing)?*
>>> *What contexts are considered "unnested"? *"Nested" isn't a documented
>>> property anywhere (except probably in the Python or Java code, which isn't
>>> useful from a portability perspective), so it's not clear how SDK
>>> developers are supposed to know what it means and does.
>>>
>>> Based on experience, rather than documentation in the proto spec [1] or
>>> in standard_coders.yaml [2], there's no portable specification of what
>>> contexts are supposed to be nested and when, which implies it's a holdover
>>> from pre-portability.
>>>
>>> But most importantly, *when is it actually used? *
>>>
>>> I understand there's a theoretical value in avoiding needing the length
>>> ahead of time when encoding very large single elements, but is that
>>> property ever taken advantage of anywhere?
>>>
>>> Currently, the Go SDK doesn't support unnested coders at all. All
>>> :bytes:v1 and string_utf8:v1 coders are assumed to be length prefixed. So
>>> values and generated by the SDK will always be marked as LP for those
>>> variable length coders, and that's what the pipeline generates for them.
>>> What's not clear to me is when an SDK should be assuming the bytes
>>> aren't length prefixed, as that's not documented anywhere, nor along with
>>> intended purpose for the distinction.
>>>
>>> I'm happy to go ahead and add such documentation to the protos or
>>> standard_coders.yaml file for posterity, but I can't until I understand the
>>> situation better. I'd like it to be documented so new SDK authors don't run
>>> into the same confusions I have.
>>>
>>> Thanks, and Cheers.
>>> Robert Burke
>>>
>>> [0] https://issues.apache.org/jira/browse/BEAM-7009
>>> [1]
>>> https://github.com/apache/beam/blob/a5b2046b10bebc59c5bde41d4cb6498058fdada2/model/pipeline/src/main/proto/beam_runner_api.proto#L672
>>> [2]
>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml#L18
>>>
>>

Reply via email to