Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-16 Thread Thomas Weise
I opened https://github.com/apache/beam/pull/8319 to eliminate the duplicate yaml file (and cover timestamp coder for the Python SDK). Would appreciate if someone could take a look. (PR doesn't affect the StrUtf8Coder subject, but it is required to fix a timer bug.) Thanks, Thomas On Fri, Apr 12

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-12 Thread Lukasz Cwik
This is a minor point Robert Burke but having access to the "stream" when decoding/encoding could mean that your reading/writing from the underlying transport channel directly and not needing to copy the bytes into/from memory. On Wed, Apr 10, 2019 at 3:45 PM Kenneth Knowles wrote: > On Mon, A

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-10 Thread Kenneth Knowles
On Mon, Apr 8, 2019 at 4:03 PM Robert Bradshaw wrote: > This email is already very long, but in summary I think the right > answer is to just get rid of Outer altogether (except possibly for > IOs, which we'd only preserve for legacy reasons until 3.0). > > - Robert > I had forgotten that compat

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Robert Burke
On Mon, 8 Apr 2019 at 16:03, Robert Bradshaw wrote: > On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles wrote: > > > > On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw > wrote: > >> > >> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles > wrote: > >> > > >> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bra

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Robert Bradshaw
On Mon, Apr 8, 2019 at 8:04 PM Kenneth Knowles wrote: > > On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw wrote: >> >> On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles wrote: >> > >> > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw wrote: >> >> >> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Kenneth Knowles
On Mon, Apr 8, 2019 at 1:57 AM Robert Bradshaw wrote: > On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles wrote: > > > > > > > > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw > wrote: > >> > >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles wrote: > >> > > >> > Nested and unnested contexts are

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-08 Thread Robert Bradshaw
On Sat, Apr 6, 2019 at 12:08 AM Kenneth Knowles wrote: > > > > On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw wrote: >> >> On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles wrote: >> > >> > Nested and unnested contexts are two different encodings. Can we just give >> > them different URNs? We can

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Kenneth Knowles
On Fri, Apr 5, 2019 at 2:24 PM Robert Bradshaw wrote: > On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles wrote: > > > > Nested and unnested contexts are two different encodings. Can we just > give them different URNs? We can even just express the length-prefixed > UTF-8 as a composition of the len

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Robert Bradshaw
On Fri, Apr 5, 2019 at 6:24 PM Kenneth Knowles wrote: > > Nested and unnested contexts are two different encodings. Can we just give > them different URNs? We can even just express the length-prefixed UTF-8 as a > composition of the length-prefix URN and the UTF-8 URN. It's not that simple, esp

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Lukasz Cwik
Also, as for the backwards compatibility discussion, I don't believe non-portable jobs will be able to be upgraded to portable jobs and hence may be a good time to make upgrade incompatible coder changes at that point in time. On Fri, Apr 5, 2019 at 1:44 PM Lukasz Cwik wrote: > Robert, I filed h

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Lukasz Cwik
Robert, I filed https://issues.apache.org/jira/browse/BEAM-7015 for removing the Python SDK copy of standard_coders.yaml and assigned it to you. On Fri, Apr 5, 2019 at 9:24 AM Kenneth Knowles wrote: > Nested and unnested contexts are two different encodings. Can we just give > them different URN

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Kenneth Knowles
Nested and unnested contexts are two different encodings. Can we just give them different URNs? We can even just express the length-prefixed UTF-8 as a composition of the length-prefix URN and the UTF-8 URN. On Fri, Apr 5, 2019 at 12:38 AM Robert Bradshaw wrote: > On Fri, Apr 5, 2019 at 12:50 AM

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-05 Thread Robert Bradshaw
On Fri, Apr 5, 2019 at 12:50 AM Heejong Lee wrote: > > Robert, does nested/unnested context work properly for Java? I believe so. It is similar to the bytes coder, that prefixes vs. not based on the context. > I can see that the Context is fixed to NESTED[1] and the encode method with > the Con

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Heejong Lee
Robert, does nested/unnested context work properly for Java? I can see that the Context is fixed to NESTED[1] and the encode method with the Context parameter is marked as deprecated[2]. [1]: https://github.com/apache/beam/blob/0868e7544fd1e96db67ff5b9e70a67802c0f0c8e/sdks/java/core/src/main/java/

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Bradshaw
I don't know why there are two separate copies of standard_coders.yaml--originally there was just one (though it did live in the Python directory). I'm guessing a copy was made rather than just pointing both to the new location, but that completely defeats the point. I can't seem to access JIRA rig

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Burke
My 2cents is that the "Textual description" should be part of the documentation of the URNs on the Proto messages, since that's the common place. I've added a short description for the varints for example, and we already have lenghthier format & protocol descriptions there for iterables and similar

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Kenneth Knowles
On Thu, Apr 4, 2019 at 1:49 PM Robert Burke wrote: > We should probably move the "java" version of the yaml file [1] to a > common location rather than deep in the java hierarchy, or copying it for > Go and Python, but that can be a separate task. It's probably non-trivial > since it looks like i

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Kenneth Knowles
On Thu, Apr 4, 2019 at 1:48 PM Kenneth Knowles wrote: > I have to actually say that a collection of test cases is not a definition > of a format. It is one of the pieces, and the other one is a textual > description in a prominent, discoverable place. > A reference implementation can also serve

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Burke
We should probably move the "java" version of the yaml file [1] to a common location rather than deep in the java hierarchy, or copying it for Go and Python, but that can be a separate task. It's probably non-trivial since it looks like it's part of a java resources structure. Luke, the Go SDK doe

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Kenneth Knowles
I have to actually say that a collection of test cases is not a definition of a format. It is one of the pieces, and the other one is a textual description in a prominent, discoverable place. Kenn On Thu, Apr 4, 2019 at 1:28 PM Lukasz Cwik wrote: > > > On Thu, Apr 4, 2019 at 1:15 PM Chamikara J

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Lukasz Cwik
On Thu, Apr 4, 2019 at 1:15 PM Chamikara Jayalath wrote: > > > On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik wrote: > >> standard_coders.yaml[1] is where we are currently defining these formats. >> Unfortunately the Python SDK has its own copy[2]. >> > > Ah great. Thanks for the pointer. Any idea

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Chamikara Jayalath
On Thu, Apr 4, 2019 at 12:15 PM Lukasz Cwik wrote: > standard_coders.yaml[1] is where we are currently defining these formats. > Unfortunately the Python SDK has its own copy[2]. > Ah great. Thanks for the pointer. Any idea why there's a separate copy for Python ? I didn't see a significant dif

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Lukasz Cwik
standard_coders.yaml[1] is where we are currently defining these formats. Unfortunately the Python SDK has its own copy[2]. Here is an example PR[3] that adds the "beam:coder:double:v1" as tests to the Java and Python SDKs to ensure interoperability. Robert Burke, does the Go SDK have a test wher

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Heejong Lee
On Thu, Apr 4, 2019 at 11:50 AM Chamikara Jayalath wrote: > > > On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw > wrote: > >> A URN defines the encoding. >> >> There are (unfortunately) *two* encodings defined for a Coder (defined >> by a URN), the nested and the unnested one. IIRC, in both Java

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Chamikara Jayalath
On Thu, Apr 4, 2019 at 11:29 AM Robert Bradshaw wrote: > A URN defines the encoding. > > There are (unfortunately) *two* encodings defined for a Coder (defined > by a URN), the nested and the unnested one. IIRC, in both Java and > Python, the nested one prefixes with a var-int length, and the > u

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Robert Bradshaw
A URN defines the encoding. There are (unfortunately) *two* encodings defined for a Coder (defined by a URN), the nested and the unnested one. IIRC, in both Java and Python, the nested one prefixes with a var-int length, and the unnested one does not. We should define the spec clearly and have cr

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-04 Thread Pablo Estrada
Could this be a backwards-incompatible change that would break pipelines from upgrading? If they have data in-flight in between operators, and we change the coder, they would break? I know very little about coders, but since nobody has mentioned it, I wanted to make sure we have it in mind. -P. On

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-03 Thread Kenneth Knowles
Agree that a coder URN defines the encoding. I see that string UTF-8 was added to the proto enum, but it needs a written spec of the encoding. Ideally some test data that different languages can use to drive compliance testing. Kenn On Wed, Apr 3, 2019 at 6:21 PM Robert Burke wrote: > String UT

Re: [DISCUSS] change the encoding scheme of Python StrUtf8Coder

2019-04-03 Thread Robert Burke
String UTF8 was recently added as a "standard coder " URN in the protos, but I don't think that developed beyond Java, so adding it to Python would be reasonable in my opinion. The Go SDK handles Strings as "custom coders" presently which for Go are always length prefixed (and reported to the Runn