Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Robert Bradshaw Fri, 05 Apr 2019 00:39:03 -0700

On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee <[email protected]> wrote:
>
> Robert, does nested/unnested context work properly for Java?


I believe so. It is similar to the bytes coder, that prefixes vs. not
based on the context.

> I can see that the Context is fixed to NESTED[1] and the encode method with 
> the Context parameter is marked as deprecated[2].
>
> [1]: 
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L68
> [2]: 
> https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/Coder.java#L132

That doesn't mean it's unused, e.g.

https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/util/CoderUtils.java#L160
https://github.com/apache/beam/blob/release-2.12.0/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/LengthPrefixCoder.java#L64

(and I'm sure there's others).

> On Thu, Apr 4, 2019 at 3:25 PM Robert Bradshaw <[email protected]> wrote:
>>
>> I don't know why there are two separate copies of
>> standard_coders.yaml--originally there was just one (though it did
>> live in the Python directory). I'm guessing a copy was made rather
>> than just pointing both to the new location, but that completely
>> defeats the point. I can't seem to access JIRA right now; could
>> someone file an issue to resolve this?
>>
>> I also think the spec should be next to the definition of the URN,
>> that's one of the reason the URNs were originally in a markdown file
>> (to encourage good documentation, literate programming style). Many
>> coders already have their specs there.
>>
>> Regarding backwards compatibility, we can't change existing coders,
>> and making new coders won't help with inference ('cause changing that
>> would also be backwards incompatible). Fortunately, I think we're
>> already doing the consistent thing here: In both Python and Java the
>> raw UTF-8 encoded bytes are encoded when used in an *unnested* context
>> and the length-prefixed UTF-8 encoded bytes are used when the coder is
>> used in a *nested* context.
>>
>> I'd really like to see the whole nested/unnested context go away, but
>> that'll probably require Beam 3.0; it causes way more confusion than
>> the couple of bytes it saves in a couple of places.
>>
>> - Robert
>>
>> On Thu, Apr 4, 2019 at 10:55 PM Robert Burke <[email protected]> wrote:
>> >
>> > My 2cents is that the "Textual description" should be part of the 
>> > documentation of the URNs on the Proto messages, since that's the common 
>> > place. I've added a short description for the varints for example, and we 
>> > already have lenghthier format & protocol descriptions there for iterables 
>> > and similar.
>> >
>> > The proto [1] *can be* the spec if we want it to be.
>> >
>> > [1]: 
>> > https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L557
>> >
>> > On Thu, 4 Apr 2019 at 13:51, Kenneth Knowles <[email protected]> wrote:
>> >>
>> >>
>> >>
>> >> On Thu, Apr 4, 2019 at 1:49 PM Robert Burke <[email protected]> wrote:
>> >>>
>> >>> We should probably move the "java" version of the yaml file [1] to a 
>> >>> common location rather than deep in the java hierarchy, or copying it 
>> >>> for Go and Python, but that can be a separate task. It's probably 
>> >>> non-trivial since it looks like it's part of a java resources structure.
>> >>
>> >>
>> >> Seems like /model is a good place for this if we don't want to invent a 
>> >> new language-independent hierarchy.
>> >>
>> >> Kenn
>> >>
>> >>
>> >>>
>> >>> Luke, the Go SDK doesn't currently do this validation, but it shouldn't 
>> >>> be difficult, given pointers to the Java and Python variants of the 
>> >>> tests to crib from [2]. Care would need to be taken so that Beam Go SDK 
>> >>> users (such as they are) aren't forced to run them, and not have the 
>> >>> yaml file to read. I'd suggest putting it with the integration tests [3].
>> >>>
>> >>> I've filed a JIRA (BEAM-7009) for tracking this Go SDK side work. [4]
>> >>>
>> >>> 1: 
>> >>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >>> 2: 
>> >>> https://github.com/apache/beam/search?q=standard_coders.yaml&unscoped_q=standard_coders.yaml
>> >>> 3: https://github.com/apache/beam/tree/master/sdks/go/test
>> >>> 4: https://issues.apache.org/jira/browse/BEAM-7009
>> >>>
>> >>>
>> >>> On Thu, 4 Apr 2019 at 13:28, Lukasz Cwik <[email protected]> wrote:
>> >>>>
>> >>>>
>> >>>>
>> >>>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath 
>> >>>> <[email protected]> wrote:
>> >>>>>
>> >>>>>
>> >>>>>
>> >>>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <[email protected]> wrote:
>> >>>>>>
>> >>>>>> standard_coders.yaml[1] is where we are currently defining these 
>> >>>>>> formats.
>> >>>>>> Unfortunately the Python SDK has its own copy[2].
>> >>>>>
>> >>>>>
>> >>>>> Ah great. Thanks for the pointer. Any idea why there's  a separate 
>> >>>>> copy for Python ? I didn't see a significant difference in definitions 
>> >>>>> looking at few random coders there but I might have missed something. 
>> >>>>> If there's no reason to maintain two, we should probably unify.
>> >>>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>> >>>>>
>> >>>>
>> >>>> Not certain as well. I did notice the timer coder definition didn't 
>> >>>> exist in the Python copy.
>> >>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as 
>> >>>>>> tests to the Java and Python SDKs to ensure interoperability.
>> >>>>>>
>> >>>>>> Robert Burke, does the Go SDK have a test where it uses 
>> >>>>>> standard_coders.yaml and runs compatibility tests?
>> >>>>>>
>> >>>>>> Chamikara, creating new coder classes is a pain since the type -> 
>> >>>>>> coder mapping per SDK language would select the non-well known type 
>> >>>>>> if we added a new one to a language. If we swapped the default 
>> >>>>>> type->coder mapping, this would still break update for pipelines 
>> >>>>>> forcing users to update their code to select the non-well known type. 
>> >>>>>> If we don't change the default type->coder mapping, the well known 
>> >>>>>> coder will gain little usage. I think we should fix the Python coder 
>> >>>>>> to use the same encoding as Java for UTF-8 strings before there are 
>> >>>>>> too many Python SDK users.
>> >>>>>
>> >>>>>
>> >>>>> I was thinking that may be we should just change the default UTF-8 
>> >>>>> coder for Fn API path which is experimental. Updating Python to do 
>> >>>>> what's done for Java is fine if we agree that encoding used for Java 
>> >>>>> should be the standard.
>> >>>>>
>> >>>>
>> >>>> That is a good idea to use the Fn API experiment to control which gets 
>> >>>> selected.
>> >>>>
>> >>>>>>
>> >>>>>>
>> >>>>>> 1: 
>> >>>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>> >>>>>> 2: 
>> >>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>> >>>>>> 3: https://github.com/apache/beam/pull/8205
>> >>>>>>
>> >>>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath 
>> >>>>>> <[email protected]> wrote:
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw 
>> >>>>>>> <[email protected]> wrote:
>> >>>>>>>>
>> >>>>>>>> A URN defines the encoding.
>> >>>>>>>>
>> >>>>>>>> There are (unfortunately) *two* encodings defined for a Coder 
>> >>>>>>>> (defined
>> >>>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>> >>>>>>>> Python, the nested one prefixes with a var-int length, and the
>> >>>>>>>> unnested one does not.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> Could you clarify where we define the exact encoding ? I only see a 
>> >>>>>>> URN for UTF-8 [1] while if you look at the implementations Java 
>> >>>>>>> includes length in the encoding [1] while Python [1] does not.
>> >>>>>>>
>> >>>>>>> [1] 
>> >>>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>> >>>>>>> [2] 
>> >>>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>> >>>>>>> [3] 
>> >>>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>> >>>>>>>
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> We should define the spec clearly and have cross-language tests.
>> >>>>>>>
>> >>>>>>>
>> >>>>>>> +1
>> >>>>>>>
>> >>>>>>> Regarding backwards compatibility, I agree that we should probably 
>> >>>>>>> not update existing coder classes. Probably we should just 
>> >>>>>>> standardize the correct encoding (may be as a comment near 
>> >>>>>>> corresponding URN in the beam_runner_api.proto ?) and create new 
>> >>>>>>> coder classes as needed.
>> >>>>>>>
>> >>>>>>>>
>> >>>>>>>>
>> >>>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <[email protected]> 
>> >>>>>>>> wrote:
>> >>>>>>>> >
>> >>>>>>>> > Could this be a backwards-incompatible change that would break 
>> >>>>>>>> > pipelines from upgrading? If they have data in-flight in between 
>> >>>>>>>> > operators, and we change the coder, they would break?
>> >>>>>>>> > I know very little about coders, but since nobody has mentioned 
>> >>>>>>>> > it, I wanted to make sure we have it in mind.
>> >>>>>>>> > -P.
>> >>>>>>>> >
>> >>>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <[email protected]> 
>> >>>>>>>> > wrote:
>> >>>>>>>> >>
>> >>>>>>>> >> Agree that a coder URN defines the encoding. I see that string 
>> >>>>>>>> >> UTF-8 was added to the proto enum, but it needs a written spec 
>> >>>>>>>> >> of the encoding. Ideally some test data that different languages 
>> >>>>>>>> >> can use to drive compliance testing.
>> >>>>>>>> >>
>> >>>>>>>> >> Kenn
>> >>>>>>>> >>
>> >>>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <[email protected]> 
>> >>>>>>>> >> wrote:
>> >>>>>>>> >>>
>> >>>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in 
>> >>>>>>>> >>> the protos, but I don't think that developed beyond Java, so 
>> >>>>>>>> >>> adding it to Python would be reasonable in my opinion.
>> >>>>>>>> >>>
>> >>>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which 
>> >>>>>>>> >>> for Go are always length prefixed (and reported to the Runner 
>> >>>>>>>> >>> as LP+CustomCoder). It would be straight forward to add the 
>> >>>>>>>> >>> correct handling for strings, as Go natively treats strings as 
>> >>>>>>>> >>> UTF8.
>> >>>>>>>> >>>
>> >>>>>>>> >>>
>> >>>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <[email protected]> 
>> >>>>>>>> >>> wrote:
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> Hi all,
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses 
>> >>>>>>>> >>>> different encoding schemes. StringUtf8Coder in Java SDK puts 
>> >>>>>>>> >>>> the varint length of the input string before actual data bytes 
>> >>>>>>>> >>>> however StrUtf8Coder in Python SDK directly encodes the input 
>> >>>>>>>> >>>> string to bytes value. For the last few weeks, I've been 
>> >>>>>>>> >>>> testing and fixing cross-language IO transforms and this 
>> >>>>>>>> >>>> discrepancy is a major blocker for me. IMO, we should unify 
>> >>>>>>>> >>>> the encoding schemes of UTF8 strings across the different SDKs 
>> >>>>>>>> >>>> and make it a standard coder. Any thoughts?
>> >>>>>>>> >>>>
>> >>>>>>>> >>>> Thanks,

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to