Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Robert Bradshaw Mon, 08 Apr 2019 01:58:28 -0700

On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <[email protected]> wrote:
>
>
>
> On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <[email protected]> wrote:
>>
>> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <[email protected]> wrote:
>> >
>> > Nested and unnested contexts are two different encodings. Can we just give 
>> > them different URNs? We can even just express the length-prefixed UTF-8 as 
>> > a composition of the length-prefix URN and the UTF-8 URN.
>>
>> It's not that simple, especially when it comes to composite encodings.
>> E.g. for some coders, nested(C) == unnested(C), for some coders
>> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
>> something else altogether (e.g. when creating a kv coder, the first
>> component must use nested context, and the second inherits the nested
>> vs. unnested context). When creating TupleCoder(A, B) one doesn't want
>> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor does
>> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
>> may themselves be large and incrementally written (e.g.
>> IterableCoder). Using distinct URNs doesn't work well if the runner is
>> free to compose and decompose tuple, iterable, etc. coders that it
>> doesn't understand.
>>
>> Until we stop using Coders for IO (a worthy but probably lofty goal)
>> we will continue to need the unnested context (lest we expect and
>> produce length-prefixed coders in text files, as bigtable keys, etc.).
>> On the other hand, almost all internal use is nested (due to sending
>> elements around as part of element streams). The other place we cross
>> over is LengthPrefixCoder that encodes its values using the unnested
>> context prefixed by the unnested encoding length.
>>
>> Perhaps a step in the right direction would be to consistently use the
>> unnested context everywhere but IOs (meaning when we talked about
>> coders from the FnAPI perspective, they're *always* in the nested
>> context, and hence always have the one and only encoding defined by
>> that URN, including when wrapped by a length prefix coder (which would
>> sometimes result in double length prefixing, but I think that's a
>> price worth paying (or we could do something more clever like make
>> length-prefix an (explicit) modifier on a coder rather than a new
>> coder itself that would default to length prefixing (or some of the
>> other delimiting schemes we've come up with) but allow the component
>> coder to offer alternative length-prefix-compatible encodings))). IOs
>> could be updated to take Parsers and Formatters (or whatever we call
>> them) with the Coders in the constructors left as syntactic sugar
>> until we could remove them in 3.0. As Luke says, we have a chance to
>> fix our coders for portable pipelines now.
>>
>> In the (very) short term, we're stuck with a nested and unnested
>> version of StringUtf8, just as we have for bytes, lest we change the
>> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
>> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
>
>
> First, let's note that "nested" and "outer" are a misnomer. The distinction 
> is whether it is the last thing encoded in the stream. In a KvCoder<KeyCoder, 
> ValueCoder> the ValueCoder is actually encoded in the "outer" context though 
> the value is nested. No doubt a good amount of confusion comes from the 
> initial and continued use of this terminology.


+1. I think of these as "self-delimiting" vs. "externally-delimited."

> So, all that said, it is a simple fact that UTF-8 and length-prefixed UTF-8 
> are two different encodings. Encodings are the fundamental concept here and 
> coders encapsulate two encodings, with some subtle and inconsistently-applied 
> rules about when to use which encoding. I think we should still give them 
> distinct URNs unless impossible. You've outlined some steps to clarify the 
> situation.

Currently, we have a one-to-two relationship between Coders and
encodings, and a one-to-one relationship between URNs and encoders.

To make the first one-to-one, we would either have to make
StringUtf8Coder unsuitable for TextIO (letting it always prefix its
contents with a length) or unsuitable for the key part of a KV, the
element of an iterable, etc. (where the length is required).
Alternatively we could give Coders the ability to return the
nested/unnested version of themselves, but this also gets messy
because it depends on the ultimate outer context which we don't always
have at hand (and leads to surprises, e.g. asking for the key coder of
a KV coder may not return the same coder that was used to construct
it).

On the other hand, breaking the one-to-one relationship of Coders and
URNs is also undesirable, because it forces a choice when serializing
to protos (where one may not always know all the contexts it's used
in) and makes the round-trip through protos non-idempotent (if the URN
gets decoded back into a Coder that does not have this dual-defined
encoding). Also, if a runner knows about special pairs of URNs that
represent the nested/unnested encodings of the same type, and rules
like knowing constraints on components of KVs, etc., this seems an
even worse situation than we're already in.

> I think the most meaningful issue is runners manipulating coders. But the 
> runner should be able to instruct the SDK (where the coder actually executes) 
> about which encoding to use. I need to think through some examples of where 
> the SDK tells the runner the encoding of something and those where the runner 
> fabricates steps and instructs the SDK what encoding to use for elements. 
> GroupByKey, GetKeys, etc, are examples where the context will change.

Combiner lifting and side inputs are good candidates as well.

>> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <[email protected]> 
>> > wrote:
>> >>
>> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <[email protected]> wrote:
>> >> >
>> >> > Robert, does nested/unnested context work properly for Java?
>> >>
>> >> I believe so. It is similar to the bytes coder, that prefixes vs. not
>> >> based on the context.
>> >>
>> >> > I can see that the Context is fixed to NESTED[1] and the encode method 
>> >> > with the Context parameter is marked as deprecated[2].
>> >> >
>> >> > [1]: 
>> >> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
>> >> > [2]: 
>> >> > https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
>> >>
>> >> That doesn't mean it's unused, e.g.
>> >>
>> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
>> >> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
>> >>
>> >> (and I'm sure there's others).
>> >>
>> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <[email protected]> 
>> >> > wrote:
>> >> >>
>> >> >> I don't know why there are two separate copies of
>> >> >> standard_coders.yaml--originally there was just one (though it did
>> >> >> live in the Python directory). I'm guessing a copy was made rather
>> >> >> than just pointing both to the new location, but that completely
>> >> >> defeats the point. I can't seem to access JIRA right now; could
>> >> >> someone file an issue to resolve this?
>> >> >>
>> >> >> I also think the spec should be next to the definition of the URN,
>> >> >> that's one of the reason the URNs were originally in a markdown file
>> >> >> (to encourage good documentation, literate programming style). Many
>> >> >> coders already have their specs there.
>> >> >>
>> >> >> Regarding backwards compatibility, we can't change existing coders,
>> >> >> and making new coders won't help with inference ('cause changing that
>> >> >> would also be backwards incompatible). Fortunately, I think we're
>> >> >> already doing the consistent thing here: In both Python and Java the
>> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> >> >> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> >> >> used in a *nested* context.
>> >> >>
>> >> >> I'd really like to see the whole nested/unnested context go away, but
>> >> >> that'll probably require Beam 3.0; it causes way more confusion than
>> >> >> the couple of bytes it saves in a couple of places.
>> >> >>
>> >> >> - Robert
>> >> >>
>> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <[email protected]> 
>> >> >> wrote:
>> >> >> >
>> >> >> > My 2cents is that the "Textual description" should be part of the 
>> >> >> > documentation of the URNs on the Proto messages, since that's the 
>> >> >> > common place. I've added a short description for the varints for 
>> >> >> > example, and we already have lenghthier format & protocol 
>> >> >> > descriptions there for iterables and similar.
>> >> >> >
>> >> >> > The proto [1] *can be* the spec if we want it to be.
>> >> >> >
>> >> >> > [1]: 
>> >> >> > https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >> >> >
>> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <[email protected]> wrote:
>> >> >> >>
>> >> >> >>
>> >> >> >>
>> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <[email protected]> 
>> >> >> >> wrote:
>> >> >> >>>
>> >> >> >>> We should probably move the "java" version of the yaml file [1] to 
>> >> >> >>> a common location rather than deep in the java hierarchy, or 
>> >> >> >>> copying it for Go and Python, but that can be a separate task. 
>> >> >> >>> It's probably non-trivial since it looks like it's part of a java 
>> >> >> >>> resources structure.
>> >> >> >>
>> >> >> >>
>> >> >> >> Seems like /model is a good place for this if we don't want to 
>> >> >> >> invent a new language-independent hierarchy.
>> >> >> >>
>> >> >> >> Kenn
>> >> >> >>
>> >> >> >>
>> >> >> >>>
>> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but it 
>> >> >> >>> shouldn't be difficult, given pointers to the Java and Python 
>> >> >> >>> variants of the tests to crib from [2]. Care would need to be 
>> >> >> >>> taken so that Beam Go SDK users (such as they are) aren't forced 
>> >> >> >>> to run them, and not have the yaml file to read. I'd suggest 
>> >> >> >>> putting it with the integration tests [3].
>> >> >> >>>
>> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. 
>> >> >> >>> [4]
>> >> >> >>>
>> >> >> >>> 1: 
>> >> >> >>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >>> 2: 
>> >> >> >>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >> >> >>>
>> >> >> >>>
>> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <[email protected]> wrote:
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>>
>> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath 
>> >> >> >>>> <[email protected]> wrote:
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <[email protected]> 
>> >> >> >>>>> wrote:
>> >> >> >>>>>>
>> >> >> >>>>>> standard_coders.yaml[1] is where we are currently defining 
>> >> >> >>>>>> these formats.
>> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a 
>> >> >> >>>>> separate copy for Python ? I didn't see a significant difference 
>> >> >> >>>>> in definitions looking at few random coders there but I might 
>> >> >> >>>>> have missed something. If there's no reason to maintain two, we 
>> >> >> >>>>> should probably unify.
>> >> >> >>>>> Also, seems like we haven't added the definition for UTF-8 coder 
>> >> >> >>>>> yet.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> Not certain as well. I did notice the timer coder definition 
>> >> >> >>>> didn't exist in the Python copy.
>> >> >> >>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" 
>> >> >> >>>>>> as tests to the Java and Python SDKs to ensure interoperability.
>> >> >> >>>>>>
>> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses 
>> >> >> >>>>>> standard_coders.yaml and runs compatibility tests?
>> >> >> >>>>>>
>> >> >> >>>>>> Chamikara, creating new coder classes is a pain since the type 
>> >> >> >>>>>> -> coder mapping per SDK language would select the non-well 
>> >> >> >>>>>> known type if we added a new one to a language. If we swapped 
>> >> >> >>>>>> the default type->coder mapping, this would still break update 
>> >> >> >>>>>> for pipelines forcing users to update their code to select the 
>> >> >> >>>>>> non-well known type. If we don't change the default type->coder 
>> >> >> >>>>>> mapping, the well known coder will gain little usage. I think 
>> >> >> >>>>>> we should fix the Python coder to use the same encoding as Java 
>> >> >> >>>>>> for UTF-8 strings before there are too many Python SDK users.
>> >> >> >>>>>
>> >> >> >>>>>
>> >> >> >>>>> I was thinking that may be we should just change the default 
>> >> >> >>>>> UTF-8 coder for Fn API path which is experimental. Updating 
>> >> >> >>>>> Python to do what's done for Java is fine if we agree that 
>> >> >> >>>>> encoding used for Java should be the standard.
>> >> >> >>>>>
>> >> >> >>>>
>> >> >> >>>> That is a good idea to use the Fn API experiment to control which 
>> >> >> >>>> gets selected.
>> >> >> >>>>
>> >> >> >>>>>>
>> >> >> >>>>>>
>> >> >> >>>>>> 1: 
>> >> >> >>>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >> >> >>>>>> 2: 
>> >> >> >>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >> >> >>>>>>
>> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath 
>> >> >> >>>>>> <[email protected]> wrote:
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw 
>> >> >> >>>>>>> <[email protected]> wrote:
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> A URN defines the encoding.
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder 
>> >> >> >>>>>>>> (defined
>> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both 
>> >> >> >>>>>>>> Java and
>> >> >> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >> >> >>>>>>>> unnested one does not.
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> Could you clarify where we define the exact encoding ? I only 
>> >> >> >>>>>>> see a URN for UTF-8 [1] while if you look at the 
>> >> >> >>>>>>> implementations Java includes length in the encoding [1] while 
>> >> >> >>>>>>> Python [1] does not.
>> >> >> >>>>>>>
>> >> >> >>>>>>> [1] 
>> >> >> >>>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >> >> >>>>>>> [2] 
>> >> >> >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >> >> >>>>>>> [3] 
>> >> >> >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> We should define the spec clearly and have cross-language 
>> >> >> >>>>>>>> tests.
>> >> >> >>>>>>>
>> >> >> >>>>>>>
>> >> >> >>>>>>> +1
>> >> >> >>>>>>>
>> >> >> >>>>>>> Regarding backwards compatibility, I agree that we should 
>> >> >> >>>>>>> probably not update existing coder classes. Probably we should 
>> >> >> >>>>>>> just standardize the correct encoding (may be as a comment 
>> >> >> >>>>>>> near corresponding URN in the beam_runner_api.proto ?) and 
>> >> >> >>>>>>> create new coder classes as needed.
>> >> >> >>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>>
>> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada 
>> >> >> >>>>>>>> <[email protected]> wrote:
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > Could this be a backwards-incompatible change that would 
>> >> >> >>>>>>>> > break pipelines from upgrading? If they have data in-flight 
>> >> >> >>>>>>>> > in between operators, and we change the coder, they would 
>> >> >> >>>>>>>> > break?
>> >> >> >>>>>>>> > I know very little about coders, but since nobody has 
>> >> >> >>>>>>>> > mentioned it, I wanted to make sure we have it in mind.
>> >> >> >>>>>>>> > -P.
>> >> >> >>>>>>>> >
>> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles 
>> >> >> >>>>>>>> > <[email protected]> wrote:
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that 
>> >> >> >>>>>>>> >> string UTF-8 was added to the proto enum, but it needs a 
>> >> >> >>>>>>>> >> written spec of the encoding. Ideally some test data that 
>> >> >> >>>>>>>> >> different languages can use to drive compliance testing.
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> Kenn
>> >> >> >>>>>>>> >>
>> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke 
>> >> >> >>>>>>>> >> <[email protected]> wrote:
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN 
>> >> >> >>>>>>>> >>> in the protos, but I don't think that developed beyond 
>> >> >> >>>>>>>> >>> Java, so adding it to Python would be reasonable in my 
>> >> >> >>>>>>>> >>> opinion.
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently 
>> >> >> >>>>>>>> >>> which for Go are always length prefixed (and reported to 
>> >> >> >>>>>>>> >>> the Runner as LP+CustomCoder). It would be straight 
>> >> >> >>>>>>>> >>> forward to add the correct handling for strings, as Go 
>> >> >> >>>>>>>> >>> natively treats strings as UTF8.
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>>
>> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee 
>> >> >> >>>>>>>> >>> <[email protected]> wrote:
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> Hi all,
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs 
>> >> >> >>>>>>>> >>>> uses different encoding schemes. StringUtf8Coder in Java 
>> >> >> >>>>>>>> >>>> SDK puts the varint length of the input string before 
>> >> >> >>>>>>>> >>>> actual data bytes however StrUtf8Coder in Python SDK 
>> >> >> >>>>>>>> >>>> directly encodes the input string to bytes value. For 
>> >> >> >>>>>>>> >>>> the last few weeks, I've been testing and fixing 
>> >> >> >>>>>>>> >>>> cross-language IO transforms and this discrepancy is a 
>> >> >> >>>>>>>> >>>> major blocker for me. IMO, we should unify the encoding 
>> >> >> >>>>>>>> >>>> schemes of UTF8 strings across the different SDKs and 
>> >> >> >>>>>>>> >>>> make it a standard coder. Any thoughts?
>> >> >> >>>>>>>> >>>>
>> >> >> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to