Thanks for offering to help with the documentation (BEAM-3203). The high
level documentation about this is here[1] and the beam_runner_api.proto
also contains a bunch of information related to coders.

Length prefixing is not the same as the nesting coder property in
Java/Python and we have been trying to get rid of that property but due to
backwards compatibility it can only be marked deprecated. Internally within
the Python/Java SDKs, we use coders for producing output to other things
like files which is why the unnested (aka outer) coder context exists. We
should not have used the coder concept for formatting/parsing IO and also
as an intermediate representation for state/PCollections/.... All the
portable Beam APIs are defined to use the nested encoding and any which
aren't is a bug. I would suggest only testing the nested variants in
standard_coders.yaml and deleting the unnested variants of tests (I'm not
certain as to why they exist in the first place).

As for the wrapping issue you brought up, anywhere a coder is referenced by
a PTransform in the pipeline graph, then the associated environment is
expected to understand all the component encodings. So if a
LengthPrefix<String> coder is in the graph referenced by transform X with
environment Y then environment Y is expected to know the encoding for
LengthPrefix and String and should be able to accept either
LengthPrefix<String> or String as input or output. This currently appears
on the data API (elements, timers) and state API (user state, side inputs)
and also providing the encoded element during splitting and finally during
graph expansion for certain well known transforms like combiners and
splittable DoFns.

1:
https://docs.google.com/document/d/1IGduUqmhWDi_69l9nG8kw73HZ5WI5wOps9Tshl5wpQA/edit#heading=h.g19rkupi8zga

On Fri, Jun 19, 2020 at 10:49 AM Robert Burke <r...@google.com> wrote:

> Hello dev@! I have questions.
>
> Context: I'm adding Beam Schemas to the Go SDK, and on the way I'm
> validating the Go SDK coders against standard_coders.yaml, per BEAM-7009[0].
>
> *When is it reasonable for a runner to send an SDK an unnested byte or
> string coder (AKA, no length prefixing)?*
> *What contexts are considered "unnested"? *"Nested" isn't a documented
> property anywhere (except probably in the Python or Java code, which isn't
> useful from a portability perspective), so it's not clear how SDK
> developers are supposed to know what it means and does.
>
> Based on experience, rather than documentation in the proto spec [1] or in
> standard_coders.yaml [2], there's no portable specification of what
> contexts are supposed to be nested and when, which implies it's a holdover
> from pre-portability.
>
> But most importantly, *when is it actually used? *
>
> I understand there's a theoretical value in avoiding needing the length
> ahead of time when encoding very large single elements, but is that
> property ever taken advantage of anywhere?
>
> Currently, the Go SDK doesn't support unnested coders at all. All
> :bytes:v1 and string_utf8:v1 coders are assumed to be length prefixed. So
> values and generated by the SDK will always be marked as LP for those
> variable length coders, and that's what the pipeline generates for them.
> What's not clear to me is when an SDK should be assuming the bytes aren't
> length prefixed, as that's not documented anywhere, nor along with intended
> purpose for the distinction.
>
> I'm happy to go ahead and add such documentation to the protos or
> standard_coders.yaml file for posterity, but I can't until I understand the
> situation better. I'd like it to be documented so new SDK authors don't run
> into the same confusions I have.
>
> Thanks, and Cheers.
> Robert Burke
>
> [0] https://issues.apache.org/jira/browse/BEAM-7009
> [1]
> https://github.com/apache/beam/blob/a5b2046b10bebc59c5bde41d4cb6498058fdada2/model/pipeline/src/main/proto/beam_runner_api.proto#L672
> [2]
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml#L18
>

Reply via email to