Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Kenneth Knowles Thu, 04 Apr 2019 13:50:34 -0700

On Thu, Apr 4, 2019 at 1:48 PM Kenneth Knowles <[email protected]> wrote:


> I have to actually say that a collection of test cases is not a definition
> of a format. It is one of the pieces, and the other one is a textual
> description in a prominent, discoverable place.
>

A reference implementation can also serve the purpose of the textual
description, possibly well. But the point is that examples are not a spec.

Kenn


>
> Kenn
>
> On Thu, Apr 4, 2019 at 1:28 PM Lukasz Cwik <[email protected]> wrote:
>
>>
>>
>> On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath <[email protected]>
>> wrote:
>>
>>>
>>>
>>> On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik <[email protected]> wrote:
>>>
>>>> standard_coders.yaml[1] is where we are currently defining these
>>>> formats.
>>>> Unfortunately the Python SDK has its own copy[2].
>>>>
>>>
>>> Ah great. Thanks for the pointer. Any idea why there's  a separate copy
>>> for Python ? I didn't see a significant difference in definitions looking
>>> at few random coders there but I might have missed something. If there's no
>>> reason to maintain two, we should probably unify.
>>> Also, seems like we haven't added the definition for UTF-8 coder yet.
>>>
>>>
>> Not certain as well. I did notice the timer coder definition didn't exist
>> in the Python copy.
>>
>>
>>>
>>>> Here is an example PR[3] that adds the "beam:coder:double:v1" as tests
>>>> to the Java and Python SDKs to ensure interoperability.
>>>>
>>>> Robert Burke, does the Go SDK have a test where it uses
>>>> standard_coders.yaml and runs compatibility tests?
>>>>
>>>> Chamikara, creating new coder classes is a pain since the type -> coder
>>>> mapping per SDK language would select the non-well known type if we added a
>>>> new one to a language. If we swapped the default type->coder mapping, this
>>>> would still break update for pipelines forcing users to update their code
>>>> to select the non-well known type. If we don't change the default
>>>> type->coder mapping, the well known coder will gain little usage. I think
>>>> we should fix the Python coder to use the same encoding as Java for UTF-8
>>>> strings before there are too many Python SDK users.
>>>>
>>>
>>> I was thinking that may be we should just change the default UTF-8 coder
>>> for Fn API path which is experimental. Updating Python to do what's done
>>> for Java is fine if we agree that encoding used for Java should be the
>>> standard.
>>>
>>>
>> That is a good idea to use the Fn API experiment to control which gets
>> selected.
>>
>>
>>>
>>>> 1:
>>>> https://github.com/apache/beam/blob/master/model/fn-execution/src/main/resources/org/apache/beam/model/fnexecution/v1/standard_coders.yaml
>>>> 2:
>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/testing/data/standard_coders.yaml
>>>> 3: https://github.com/apache/beam/pull/8205
>>>>
>>>> On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath <
>>>> [email protected]> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> A URN defines the encoding.
>>>>>>
>>>>>> There are (unfortunately) *two* encodings defined for a Coder (defined
>>>>>> by a URN), the nested and the unnested one. IIRC, in both Java and
>>>>>> Python, the nested one prefixes with a var-int length, and the
>>>>>> unnested one does not.
>>>>>>
>>>>>
>>>>> Could you clarify where we define the exact encoding ? I only see a
>>>>> URN for UTF-8 [1] while if you look at the implementations Java includes
>>>>> length in the encoding [1] while Python [1] does not.
>>>>>
>>>>> [1]
>>>>> https://github.com/apache/beam/blob/069fc3de95bd96f34c363308ad9ba988ab58502d/model/pipeline/src/main/proto/beam_runner_api.proto#L563
>>>>> [2]
>>>>> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/coders/StringUtf8Coder.java#L50
>>>>> [3]
>>>>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/coders/coders.py#L321
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> We should define the spec clearly and have cross-language tests.
>>>>>>
>>>>>
>>>>> +1
>>>>>
>>>>> Regarding backwards compatibility, I agree that we should probably not
>>>>> update existing coder classes. Probably we should just standardize the
>>>>> correct encoding (may be as a comment near corresponding URN in the
>>>>> beam_runner_api.proto ?) and create new coder classes as needed.
>>>>>
>>>>>
>>>>>>
>>>>>> On Thu, Apr 4, 2019 at 8:13 PM Pablo Estrada <[email protected]>
>>>>>> wrote:
>>>>>> >
>>>>>> > Could this be a backwards-incompatible change that would break
>>>>>> pipelines from upgrading? If they have data in-flight in between 
>>>>>> operators,
>>>>>> and we change the coder, they would break?
>>>>>> > I know very little about coders, but since nobody has mentioned it,
>>>>>> I wanted to make sure we have it in mind.
>>>>>> > -P.
>>>>>> >
>>>>>> > On Wed, Apr 3, 2019 at 8:33 PM Kenneth Knowles <[email protected]>
>>>>>> wrote:
>>>>>> >>
>>>>>> >> Agree that a coder URN defines the encoding. I see that string
>>>>>> UTF-8 was added to the proto enum, but it needs a written spec of the
>>>>>> encoding. Ideally some test data that different languages can use to 
>>>>>> drive
>>>>>> compliance testing.
>>>>>> >>
>>>>>> >> Kenn
>>>>>> >>
>>>>>> >> On Wed, Apr 3, 2019 at 6:21 PM Robert Burke <[email protected]>
>>>>>> wrote:
>>>>>> >>>
>>>>>> >>> String UTF8 was recently added as a "standard coder " URN in the
>>>>>> protos, but I don't think that developed beyond Java, so adding it to
>>>>>> Python would be reasonable in my opinion.
>>>>>> >>>
>>>>>> >>> The Go SDK handles Strings as "custom coders" presently which for
>>>>>> Go are always length prefixed (and reported to the Runner as
>>>>>> LP+CustomCoder). It would be straight forward to add the correct handling
>>>>>> for strings, as Go natively treats strings as UTF8.
>>>>>> >>>
>>>>>> >>>
>>>>>> >>> On Wed, Apr 3, 2019, 5:03 PM Heejong Lee <[email protected]>
>>>>>> wrote:
>>>>>> >>>>
>>>>>> >>>> Hi all,
>>>>>> >>>>
>>>>>> >>>> It looks like UTF-8 String Coder in Java and Python SDKs uses
>>>>>> different encoding schemes. StringUtf8Coder in Java SDK puts the varint
>>>>>> length of the input string before actual data bytes however StrUtf8Coder 
>>>>>> in
>>>>>> Python SDK directly encodes the input string to bytes value. For the last
>>>>>> few weeks, I've been testing and fixing cross-language IO transforms and
>>>>>> this discrepancy is a major blocker for me. IMO, we should unify the
>>>>>> encoding schemes of UTF8 strings across the different SDKs and make it a
>>>>>> standard coder. Any thoughts?
>>>>>> >>>>
>>>>>> >>>> Thanks,
>>>>>>
>>>>>

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

Reply via email to