Actually, I gave bad information since I wanted to double check what I
said. The outer context still exists but only in the case of length
prefixing. So testing the 'unnested'/'outer' encoding is still necessary
for coders where it differs but it is only used in the LP(X) scenario. So
updating the tests to contain LP(Bytes) and LP(String) instead of the
unnested variant is what we should really be doing.

LP(X) -> uses the outer/unnested encoding of X. So there is only a single
length prefix that exists and not two as I had mentioned. The difference is
that the LP coder has a well known and understood format for length while X
can use any arbitrary format it wants to delimit the element boundary (e.g.
varint prefix, big endian prefix, delimiter character like \0, ...). It
just so happens that LP/String/Bytes use the same delimiting scheme so
Encoding(LP(Bytes)) == Encoding(Bytes). This is actually annoying because
encoding of X can't be converted into the encoding of LP(X) by without
understanding how X does its element delimiting.

On Fri, Jun 19, 2020 at 12:02 PM Luke Cwik <lc...@google.com> wrote:

>
>
> On Fri, Jun 19, 2020 at 11:53 AM Robert Burke <rob...@frantil.com> wrote:
>
>> Marvelous! Thank you Luke!
>>  I do like having less work to do, while finding out the Go SDK is doing
>> things the right way already (if not yet completely).
>>
>> Given that the portable coders are always in the "nested" context, it's
>> reasonable that we don't mention them in the proto. I'm less concerned
>> about removing the test cases for them right now, as it sounds like those
>> situations still come up in the legacy executions of the Python and Java
>> SDKs.
>>
>> The action items which I'm taking on as part of BEAM-3203 :
>>
>> 1. Document in the proto file, the details of encodings that are
>> missing:  bytes, string, kv, bool, global window, interval window, length
>> prefix
>>
>
> Some of this may have already been done in beam_runner_api.proto as I know
> that a few folks have improved this in the past couple of months.
>
>
>> 2. Document the redundancy of the LP(bytes)/bytes and LP(string)/string
>> situation. If i understand correctly, when an SDK sees either of these
>> coders, they're always only expecting a single length prefix for the
>> following byte stream.
>>
>
> LP(bytes) and LP(string) should have one length prefix followed by
> whatever the correct "nested" encoding for string/bytes is. So in practice
> this turns into two length prefixes. LP('abc') should be '4' '3' 'abc'. The
> '4' represents how many bytes the element takes and the '3' is an internal
> implementation detail of the string coder so that it knows how many bytes
> it has to decode. The encoding for LP(X) and X are never the same.
>

>
>> 3. Mention in the proto that test cases for the coders are in
>> standard_coders.yaml
>>
> 4. In standard_coders.yaml, clearly document what Nested meant and that
>> Portable SDKs only need to handle "Nested" examples.
>> 5. Add a test case(s) for  LP coders as they don't exist yet.
>>
>
> Sounds good for 3-5.
>
>
>>
>> That last one will be done separately, but still falls into
>> documentation, as test cases are documentation when commented effectively.
>>
>> Another separate nit related to the standard_coders.yaml that i noticed:
>> the encoded example bytes are not in Latin1. They're the bytes
>> *interpreted* as latin1, and then written out. Initially i was decoding
>> those bytes as latin1 and the tests were all failing. Instead one needs to
>> re-encode the bytes as latin1 to get the expected bytes for comparison.
>> This is a separate larger task since the yaml examples would need to be
>> re-encoded and each of the SDK tests
>> that use it would need to be adjusted for the correct use.
>>
>
> +1, the interpretation of the YAML representation is not well documented
> but it might be a well known property of all YAML files.
>
>
>>
>> In the meantime, I'll document the correct procedure for SDK authors to
>> take to use the file correctly.
>>
>> Thanks again!
>>
>>
>>
>> On Fri, Jun 19, 2020, 11:22 AM Luke Cwik <lc...@google.com> wrote:
>>
>>> Thanks for offering to help with the documentation (BEAM-3203). The high
>>> level documentation about this is here[1] and the beam_runner_api.proto
>>> also contains a bunch of information related to coders.
>>>
>>> Length prefixing is not the same as the nesting coder property in
>>> Java/Python and we have been trying to get rid of that property but due to
>>> backwards compatibility it can only be marked deprecated. Internally within
>>> the Python/Java SDKs, we use coders for producing output to other things
>>> like files which is why the unnested (aka outer) coder context exists. We
>>> should not have used the coder concept for formatting/parsing IO and also
>>> as an intermediate representation for state/PCollections/.... All the
>>> portable Beam APIs are defined to use the nested encoding and any which
>>> aren't is a bug. I would suggest only testing the nested variants in
>>> standard_coders.yaml and deleting the unnested variants of tests (I'm not
>>> certain as to why they exist in the first place).
>>>
>>> As for the wrapping issue you brought up, anywhere a coder is referenced
>>> by a PTransform in the pipeline graph, then the associated environment is
>>> expected to understand all the component encodings. So if a
>>> LengthPrefix<String> coder is in the graph referenced by transform X with
>>> environment Y then environment Y is expected to know the encoding for
>>> LengthPrefix and String and should be able to accept either
>>> LengthPrefix<String> or String as input or output. This currently appears
>>> on the data API (elements, timers) and state API (user state, side inputs)
>>> and also providing the encoded element during splitting and finally during
>>> graph expansion for certain well known transforms like combiners and
>>> splittable DoFns.
>>>
>>> 1:
>>> https://docs.google.com/document/d/1IGduUqmhWDi_69l9nG8kw73HZ5WI5wOps9Tshl5wpQA/edit#heading=h.g19rkupi8zga
>>>
>>> On Fri, Jun 19, 2020 at 10:49 AM Robert Burke <r...@google.com> wrote:
>>>
>>>> Hello dev@! I have questions.
>>>>
>>>> Context: I'm adding Beam Schemas to the Go SDK, and on the way I'm
>>>> validating the Go SDK coders against standard_coders.yaml, per 
>>>> BEAM-7009[0].
>>>>
>>>> *When is it reasonable for a runner to send an SDK an unnested byte or
>>>> string coder (AKA, no length prefixing)?*
>>>> *What contexts are considered "unnested"? *"Nested" isn't a documented
>>>> property anywhere (except probably in the Python or Java code, which isn't
>>>> useful from a portability perspective), so it's not clear how SDK
>>>> developers are supposed to know what it means and does.
>>>>
>>>> Based on experience, rather than documentation in the proto spec [1] or
>>>> in standard_coders.yaml [2], there's no portable specification of what
>>>> contexts are supposed to be nested and when, which implies it's a holdover
>>>> from pre-portability.
>>>>
>>>> But most importantly, *when is it actually used? *
>>>>
>>>> I understand there's a theoretical value in avoiding needing the length
>>>> ahead of time when encoding very large single elements, but is that
>>>> property ever taken advantage of anywhere?
>>>>
>>>> Currently, the Go SDK doesn't support unnested coders at all. All
>>>> :bytes:v1 and string_utf8:v1 coders are assumed to be length prefixed. So
>>>> values and generated by the SDK will always be marked as LP for those
>>>> variable length coders, and that's what the pipeline generates for them.
>>>> What's not clear to me is when an SDK should be assuming the bytes
>>>> aren't length prefixed, as that's not documented anywhere, nor along with
>>>> intended purpose for the distinction.
>>>>
>>>> I'm happy to go ahead and add such documentation to the protos or
>>>> standard_coders.yaml file for posterity, but I can't until I understand the
>>>> situation better. I'd like it to be documented so new SDK authors don't run
>>>> into the same confusions I have.
>>>>
>>>> Thanks, and Cheers.
>>>> Robert Burke
>>>>
>>>> [0] https://issues.apache.org/jira/browse/BEAM-7009
>>>> [1]
>>>> https://github.com/apache/beam/blob/a5b2046b10bebc59c5bde41d4cb6498058fdada2/model/pipeline/src/main/proto/beam_runner_api.proto#L672
>>>> [2]
>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml#L18
>>>>
>>>

Reply via email to