Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Robert Burke Mon, 08 Apr 2019 16:40:06 -0700

On Mon, 8 Apr 2019 at 16:03, Robert Bradshaw <[email protected]> wrote:


> On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles <[email protected]> wrote:
> >
> > On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw <[email protected]>
> wrote:
> >>
> >> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles <[email protected]>
> wrote:
> >> >
> >> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw <[email protected]>
> wrote:
> >> >>
> >> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles <[email protected]>
> wrote:
> >> >> >
> >> >> > Nested and unnested contexts are two different encodings. Can we
> just give them different URNs? We can even just express the length-prefixed
> UTF-8 as a composition of the length-prefix URN and the UTF-8 URN.
> >> >>
> >> >> It's not that simple, especially when it comes to composite
> encodings.
> >> >> E.g. for some coders, nested(C) == unnested(C), for some coders
> >> >> nested(C) == lenth_prefix(unnested(C)), and for other coders it's
> >> >> something else altogether (e.g. when creating a kv coder, the first
> >> >> component must use nested context, and the second inherits the nested
> >> >> vs. unnested context). When creating TupleCoder(A, B) one doesn't
> want
> >> >> to forcibly use LenthPrefixCoder(A) and LengthPrefixCoder(B), nor
> does
> >> >> one want to force LengthPrefixCoder(TupleCoder(A, B)) because A and B
> >> >> may themselves be large and incrementally written (e.g.
> >> >> IterableCoder). Using distinct URNs doesn't work well if the runner
> is
> >> >> free to compose and decompose tuple, iterable, etc. coders that it
> >> >> doesn't understand.
> >> >>
> >> >> Until we stop using Coders for IO (a worthy but probably lofty goal)
> >> >> we will continue to need the unnested context (lest we expect and
> >> >> produce length-prefixed coders in text files, as bigtable keys,
> etc.).
> >> >> On the other hand, almost all internal use is nested (due to sending
> >> >> elements around as part of element streams). The other place we cross
> >> >> over is LengthPrefixCoder that encodes its values using the unnested
> >> >> context prefixed by the unnested encoding length.
> >> >>
> >> >> Perhaps a step in the right direction would be to consistently use
> the
> >> >> unnested context everywhere but IOs (meaning when we talked about
> >> >> coders from the FnAPI perspective, they're *always* in the nested
> >> >> context, and hence always have the one and only encoding defined by
> >> >> that URN, including when wrapped by a length prefix coder (which
> would
> >> >> sometimes result in double length prefixing, but I think that's a
> >> >> price worth paying (or we could do something more clever like make
> >> >> length-prefix an (explicit) modifier on a coder rather than a new
> >> >> coder itself that would default to length prefixing (or some of the
> >> >> other delimiting schemes we've come up with) but allow the component
> >> >> coder to offer alternative length-prefix-compatible encodings))). IOs
> >> >> could be updated to take Parsers and Formatters (or whatever we call
> >> >> them) with the Coders in the constructors left as syntactic sugar
> >> >> until we could remove them in 3.0. As Luke says, we have a chance to
> >> >> fix our coders for portable pipelines now.
> >> >>
> >> >> In the (very) short term, we're stuck with a nested and unnested
> >> >> version of StringUtf8, just as we have for bytes, lest we change the
> >> >> meaning of (or disallow some of) TupleCoder[StrUtf8Coder, ...],
> >> >> LengthPrefixCoder[StrUtf8Coder], and using StringUtf8Coder for IO.
> >> >
> >> >
> >> > First, let's note that "nested" and "outer" are a misnomer. The
> distinction is whether it is the last thing encoded in the stream. In a
> KvCoder<KeyCoder, ValueCoder> the ValueCoder is actually encoded in the
> "outer" context though the value is nested. No doubt a good amount of
> confusion comes from the initial and continued use of this terminology.
> >>
> >> +1. I think of these as "self-delimiting" vs. "externally-delimited."
> >>
> >> > So, all that said, it is a simple fact that UTF-8 and length-prefixed
> UTF-8 are two different encodings. Encodings are the fundamental concept
> here and coders encapsulate two encodings, with some subtle and
> inconsistently-applied rules about when to use which encoding. I think we
> should still give them distinct URNs unless impossible. You've outlined
> some steps to clarify the situation.
> >>
> >> Currently, we have a one-to-two relationship between Coders and
> >> encodings, and a one-to-one relationship between URNs and encoders.
> >>
> >> To make the first one-to-one, we would either have to make
> >> StringUtf8Coder unsuitable for TextIO (letting it always prefix its
> >> contents with a length) or unsuitable for the key part of a KV, the
> >> element of an iterable, etc. (where the length is required).
> >
> > Definitely in favor of dropping TextIO from the conversation. Just need
> to think about compatibility, but I would save that for after the idealized
> end goal is defined, and hack it somehow. to the one-to-one vs one-to-two,
> can we have:
> >
> >  - SDK coder -> proto: be smart about knowing what the context is and
> uses the appropriate URN (this is when building the graph)
>
> So you'd have to day "give me the outer version of yourself" or "give
> me the inner version of yourself" which would (sometimes) get passed
> down to components. And this extra bit would also have to be set in
> the object -> proto caches. It's the same nested/outer logic, just put
> in a different place.
>
> The context here is almost always "inner" because even if it's the
> PCollection element's coder, it must be encoded in a self-delimiting
> way as per the (current) data-plane protocol. On the other hand, if
> you have a PCollection<KV> that's later used as a map side input, the
> K appears alone (potentiall outer) during lookups. Of course one can
> always use inner and not care about the redundancy.
>
> >  - proto -> SDK: instantiate a variant of the coder that is fixed to one
> context (this is when you get a process bundle instruction from the runner)
> >
> > I think this is more-or-less the idea you mean by...
> >
> >> Alternatively we could give Coders the ability to return the
> >> nested/unnested version of themselves, but this also gets messy
> >> because it depends on the ultimate outer context which we don't always
> >> have at hand (and leads to surprises, e.g. asking for the key coder of
> >> a KV coder may not return the same coder that was used to construct
> >> it).
>
> I was thinking of doing this at graph construction time, but at proto
> translation time could be better (though it complicates that).
>
> > ... and I don't think it would be too messy, because fixing the context
> would happen around submission time, yes? Basically, it keep "Java Coder"
> as a vestigial concept that compiles away, much like PValue/PInput/POutput.
>
> Unfortunately nested vs. unnested is already a model-level concept
> (e.g. Python does the same, and has been changed to match java
> exactly). Not that we couldn't change this for FnApi running.
>
> >> On the other hand, breaking the one-to-one relationship of Coders and
> >> URNs is also undesirable, because it forces a choice when serializing
> >> to protos (where one may not always know all the contexts it's used
> >> in)
> >
> > When would you not know the context? I posit that toProto should always
> know the context.
> >
> >> and makes the round-trip through protos non-idempotent (if the URN
> >> gets decoded back into a Coder that does not have this dual-defined
> >> encoding).
> >
> > This doesn't seem like such a problem. Of course, I would say that, as I
> am proposing to do just this :-)
>
> I think we'll have even more of this (happening even at pipeline
> construction time, e.g. for cross-language pipelines).
>
> >> Also, if a runner knows about special pairs of URNs that
> >> represent the nested/unnested encodings of the same type, and rules
> >> like knowing constraints on components of KVs, etc., this seems an
> >> even worse situation than we're already in.
> >
> > Perhaps even worse than this would be to enshrine the idea of
> self-delimiting/non-self-delimiting coders as a flag in the proto, so
> runners know to flip it. That is really bad, making the model inherit
> accidental complexity from the Java SDK. But is this flag always just a
> length prefix, in practice? If so, then the smarts to add to toProto would
> cause the self-delimiting variant of UTF-8 coder to be a composite
> LengthPrefix(UTF-8) coder. Then the model doesn't change and the runner
> does not need to be aware of special pairs of URNs.
>
> No, most of the "basic" coders (e.g. ints, doubles, iterables, ...)
> are already self-delimiting.
>
> >> > I think the most meaningful issue is runners manipulating coders. But
> the runner should be able to instruct the SDK (where the coder actually
> executes) about which encoding to use. I need to think through some
> examples of where the SDK tells the runner the encoding of something and
> those where the runner fabricates steps and instructs the SDK what encoding
> to use for elements. GroupByKey, GetKeys, etc, are examples where the
> context will change.
> >>
> >> Combiner lifting and side inputs are good candidates as well.
> >
> >
> > GBK example:
> >
> >  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
> >  - Output: KvCoder[Outer]<KeyCoder[Inner],
> IterableCoder[Outer]<ValueCoder[Inner]>>
> >
> > In this case, it is the SDK providing both, so it can easily make the
> ValueCoder[Inner] a LengthPrefix(ValueCoder[Outer]). If the ValueCoder is
> already length-prefixed there is a cost. That could be detected and elided
> I think; might need additional methods but not if the ValueCoder toProto
> already encoded in a readable way.
> >
> > GBKVaiGBKO example, where the runner chooses a particular implementation
> strategy. Assuming the client language is not Java so inner/outer switching
> cannot be done unless explicitly represented in the proto.
> >
> >  - Input: KvCoder[Outer]<KeyCoder[Inner], ValueCoder[Outer]>
> >  - Intermediate: KvCoder[Outer]<KeyCoder[Inner],
> IterableCoder[Outer]<WindowedValueCoder[Outer]<ValueCoder[Inner]>>>
> >
> > Since the input ValueCoder is always Outer, it is easy to add a
> length-prefix without knowing what it is. So from these examples, I
> hypothesize that the hard case is when there is a user coder in Inner
> encoding and the runner wants to turn it into Outer encoding, but the Inner
> encoding does not explicitly indicate that it is a LengthPrefix(Outer)
> encoding. I believe that this is not a problem - the runner can just leave
> it as the Inner encoding since every Inner encoding is a safe Outer
> encoding (but not vice-versa). I don't know of a real example, but if there
> were a runner-inserted GetKeys then the KeyCoder[Inner] might *want* to
> change to KeyCoder[Outer] but that wouldn't be a requirement.
> >
> > For side inputs, PCollection<ValueCoder[Outer]> the runner may want to
> put a bunch of elements in some data structure that requires
> self-delimiting. Same logic, it is always easy to move from Outer to Inner.
>
> Moving an Iterable from Outer to Inner by adding a length-prefix is
> really expensive, but useless as Outer is already length-delimited.
> Moving a KV coder form Outer to Inner involves changing the second
> component. I don't think we want to bake special rules like this into
> the model.
>
>
> This email is already very long, but in summary I think the right
> answer is to just get rid of Outer altogether (except possibly for
> IOs, which we'd only preserve for legacy reasons until 3.0). The
> question remains whether we should allow a Coder C (as part of its
> spec, known to all that know the URN of C) to declare an alternative
> coder C' for LengthPrefix(C), that is different than e ->
> varLen(size(encode(C, e))) + encode(C, e), possibly (often?) C' == C,
> to avoid double-length-prefixing.
>

Technically this is how the Go SDK treats all "custom coders" which in this
case means, anything that's not defined by the beam spec.
Such coders are effectively C', the user side is provided an element to
encode, and is expected to produce bytes (C), which the system
automatically Length Prefixes, and indicates to the runner the bytes are
LengthPrefix(C).
On decode, when it sees LengthPrefix(C), it peels off the length prefix N,
and then provides the user coder with the length bytes N.

The real trick is that there's no dedicated LengthPrefix coder in the SDK,
it's simply a part of the CustomCoder implementation which wraps all user
coders. The way I see it, the LengthPrefix for user elements is for the
Runner's benefit, not the User's, and having the runner "own" that part of
the encoding avoids weird issues where it might get striped unintentionally

There's a more trusting approach where a user coder is handed the stream
directly (io.Reader in Go's instance), and manages it's own in memory byte
buffers, and reads from the stream, but, in the SDK's present state, there
hasn't been a requirement for that level of efficiency yet, in any of the
Go SDK profiling benchmarks I've looked at. I'm largely sending around
protocol buffers, rather than Ints, floats or other ~1-2 bytewords where a
LP represents higher overhead.

Robert Burke (the other Robert)

>
> - Robert
>
>
> >> >> > On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw <
> [email protected]> wrote:
> >> >> >>
> >> >> >> On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <[email protected]>
> wrote:
> >> >> >> >
> >> >> >> > Robert, does nested/unnested context work properly for Java?
> >> >> >>
> >> >> >> I believe so. It is similar to the bytes coder, that prefixes vs.
> not
> >> >> >> based on the context.
> >> >> >>
> >> >> >> > I can see that the Context is fixed to NESTED[1] and the encode
> method with the Context parameter is marked as deprecated[2].
> >> >> >> >
> >> >> >> > [1]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> >> >> >> > [2]:
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132
> >> >> >>
> >> >> >> That doesn't mean it's unused, e.g.
> >> >> >>
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
> >> >> >>
> https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64
> >> >> >>
> >> >> >> (and I'm sure there's others).
> >> >> >>
> >> >> >> > On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <
> [email protected]> wrote:
> >> >> >> >>
> >> >> >> >> I don't know why there are two separate copies of
> >> >> >> >> standard_coders.yaml--originally there was just one (though it
> did
> >> >> >> >> live in the Python directory). I'm guessing a copy was made
> rather
> >> >> >> >> than just pointing both to the new location, but that
> completely
> >> >> >> >> defeats the point. I can't seem to access JIRA right now; could
> >> >> >> >> someone file an issue to resolve this?
> >> >> >> >>
> >> >> >> >> I also think the spec should be next to the definition of the
> URN,
> >> >> >> >> that's one of the reason the URNs were originally in a
> markdown file
> >> >> >> >> (to encourage good documentation, literate programming style).
> Many
> >> >> >> >> coders already have their specs there.
> >> >> >> >>
> >> >> >> >> Regarding backwards compatibility, we can't change existing
> coders,
> >> >> >> >> and making new coders won't help with inference ('cause
> changing that
> >> >> >> >> would also be backwards incompatible). Fortunately, I think
> we're
> >> >> >> >> already doing the consistent thing here: In both Python and
> Java the
> >> >> >> >> raw UTF-8 encoded bytes are encoded when used in an *unnested*
> context
> >> >> >> >> and the length-prefixed UTF-8 encoded bytes are used when the
> coder is
> >> >> >> >> used in a *nested* context.
> >> >> >> >>
> >> >> >> >> I'd really like to see the whole nested/unnested context go
> away, but
> >> >> >> >> that'll probably require Beam 3.0; it causes way more
> confusion than
> >> >> >> >> the couple of bytes it saves in a couple of places.
> >> >> >> >>
> >> >> >> >> - Robert
> >> >> >> >>
> >> >> >> >> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <
> [email protected]> wrote:
> >> >> >> >> >
> >> >> >> >> > My 2cents is that the "Textual description" should be part
> of the documentation of the URNs on the Proto messages, since that's the
> common place. I've added a short description for the varints for example,
> and we already have lenghthier format & protocol descriptions there for
> iterables and similar.
> >> >> >> >> >
> >> >> >> >> > The proto [1] *can be* the spec if we want it to be.
> >> >> >> >> >
> >> >> >> >> > [1]:
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
> >> >> >> >> >
> >> >> >> >> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <
> [email protected]> wrote:
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <
> [email protected]> wrote:
> >> >> >> >> >>>
> >> >> >> >> >>> We should probably move the "java" version of the yaml
> file [1] to a common location rather than deep in the java hierarchy, or
> copying it for Go and Python, but that can be a separate task. It's
> probably non-trivial since it looks like it's part of a java resources
> structure.
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >> Seems like /model is a good place for this if we don't want
> to invent a new language-independent hierarchy.
> >> >> >> >> >>
> >> >> >> >> >> Kenn
> >> >> >> >> >>
> >> >> >> >> >>
> >> >> >> >> >>>
> >> >> >> >> >>> Luke, the Go SDK doesn't currently do this validation, but
> it shouldn't be difficult, given pointers to the Java and Python variants
> of the tests to crib from [2]. Care would need to be taken so that Beam Go
> SDK users (such as they are) aren't forced to run them, and not have the
> yaml file to read. I'd suggest putting it with the integration tests [3].
> >> >> >> >> >>>
> >> >> >> >> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK
> side work. [4]
> >> >> >> >> >>>
> >> >> >> >> >>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>> 2:
> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
> >> >> >> >> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
> >> >> >> >> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
> >> >> >> >> >>>
> >> >> >> >> >>>
> >> >> >> >> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <[email protected]>
> wrote:
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <
> [email protected]> wrote:
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <
> [email protected]> wrote:
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> standard_coders.yaml[1] is where we are currently
> defining these formats.
> >> >> >> >> >>>>>> Unfortunately the Python SDK has its own copy[2].
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> Ah great. Thanks for the pointer. Any idea why there's
> a separate copy for Python ? I didn't see a significant difference in
> definitions looking at few random coders there but I might have missed
> something. If there's no reason to maintain two, we should probably unify.
> >> >> >> >> >>>>> Also, seems like we haven't added the definition for
> UTF-8 coder yet.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> Not certain as well. I did notice the timer coder
> definition didn't exist in the Python copy.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Here is an example PR[3] that adds the
> "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure
> interoperability.
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Robert Burke, does the Go SDK have a test where it uses
> standard_coders.yaml and runs compatibility tests?
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> Chamikara, creating new coder classes is a pain since
> the type -> coder mapping per SDK language would select the non-well known
> type if we added a new one to a language. If we swapped the default
> type->coder mapping, this would still break update for pipelines forcing
> users to update their code to select the non-well known type. If we don't
> change the default type->coder mapping, the well known coder will gain
> little usage. I think we should fix the Python coder to use the same
> encoding as Java for UTF-8 strings before there are too many Python SDK
> users.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>>
> >> >> >> >> >>>>> I was thinking that may be we should just change the
> default UTF-8 coder for Fn API path which is experimental. Updating Python
> to do what's done for Java is fine if we agree that encoding used for Java
> should be the standard.
> >> >> >> >> >>>>>
> >> >> >> >> >>>>
> >> >> >> >> >>>> That is a good idea to use the Fn API experiment to
> control which gets selected.
> >> >> >> >> >>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> 1:
> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
> >> >> >> >> >>>>>> 2:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
> >> >> >> >> >>>>>> 3: https://github.com/apache/beam/pull/8205
> >> >> >> >> >>>>>>
> >> >> >> >> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> A URN defines the encoding.
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> There are (unfortunately) *two* encodings defined for
> a Coder (defined
> >> >> >> >> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in
> both Java and
> >> >> >> >> >>>>>>>> Python, the nested one prefixes with a var-int
> length, and the
> >> >> >> >> >>>>>>>> unnested one does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Could you clarify where we define the exact encoding ?
> I only see a URN for UTF-8 [1] while if you look at the implementations
> Java includes length in the encoding [1] while Python [1] does not.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> [1]
> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
> >> >> >> >> >>>>>>> [2]
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
> >> >> >> >> >>>>>>> [3]
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> We should define the spec clearly and have
> cross-language tests.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> +1
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>> Regarding backwards compatibility, I agree that we
> should probably not update existing coder classes. Probably we should just
> standardize the correct encoding (may be as a comment near corresponding
> URN in the beam_runner_api.proto ?) and create new coder classes as needed.
> >> >> >> >> >>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>>
> >> >> >> >> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > Could this be a backwards-incompatible change that
> would break pipelines from upgrading? If they have data in-flight in
> between operators, and we change the coder, they would break?
> >> >> >> >> >>>>>>>> > I know very little about coders, but since nobody
> has mentioned it, I wanted to make sure we have it in mind.
> >> >> >> >> >>>>>>>> > -P.
> >> >> >> >> >>>>>>>> >
> >> >> >> >> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Agree that a coder URN defines the encoding. I see
> that string UTF-8 was added to the proto enum, but it needs a written spec
> of the encoding. Ideally some test data that different languages can use to
> drive compliance testing.
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> Kenn
> >> >> >> >> >>>>>>>> >>
> >> >> >> >> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> String UTF8 was recently added as a "standard
> coder " URN in the protos, but I don't think that developed beyond Java, so
> adding it to Python would be reasonable in my opinion.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> The Go SDK handles Strings as "custom coders"
> presently which for Go are always length prefixed (and reported to the
> Runner as LP+CustomCoder). It would be straight forward to add the correct
> handling for strings, as Go natively treats strings as UTF8.
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>>
> >> >> >> >> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <
> [email protected]> wrote:
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Hi all,
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and
> Python SDKs uses different encoding schemes. StringUtf8Coder in Java SDK
> puts the varint length of the input string before actual data bytes however
> StrUtf8Coder in Python SDK directly encodes the input string to bytes
> value. For the last few weeks, I've been testing and fixing cross-language
> IO transforms and this discrepancy is a major blocker for me. IMO, we
> should unify the encoding schemes of UTF8 strings across the different SDKs
> and make it a standard coder. Any thoughts?
> >> >> >> >> >>>>>>>> >>>>
> >> >> >> >> >>>>>>>> >>>> Thanks,
>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to