Re: Schemas in the Go SDK

2019-01-04 Thread Robert Burke
e isn't so much for Ints or Doubles, but for user types
> such as Protocol Buffers, but not just those. There will be some users who
> prize efficiency first, and readability second. The Go SDK presently uses
> JSON encoding by default which has many of the properties of schemas, but
> is severely limiting for power users.
>
>
> It sounds like the following are true:
> 1. Full use of the Schemas in the Go SDK will require FnAPI support.
> * Until the FnAPI supports it, and the semantics are implemented in the
> ULR, the Go SDK probably shouldn't begin to implement against it.
> * This is identical to Go's lack of SplitableDoFn keeping Beam Go
> pipelines from scaling or from having Cross Language IO, which is also a
> precursor to BeamGo using Beam SQL.
> 2. The main collision between Schemas and Coders are in the event that a
> given type has both defined for it: Which is being used and when?
> * This seems to me more to do with being able to enable use of the
> syntactic sugar or not, but we know that at construction time, by the very
> use of the sugar.
> * If a file wants to materialize a file encoded with the Schema, one would
> need to encode that in the DoFn doing the writing somehow (eg. ForceSchema
> or ForceCoder, whichever we want to make the default). This has pipeline
> compatibility implications.
>
> It's not presently possible for Go to annotate function parameters, but
> something could be worked out, similarly to how SideInputs are configured
> in the Go SDK. I'd be concerned about the efficiency of those operations
> though, even with Generics or code generation.
>
>
> On Thu, 3 Jan 2019 at 16:33 Reuven Lax  wrote:
>
>> On Fri, Jan 4, 2019 at 1:19 AM Robert Burke  wrote:
>>
>>> Very interesting Reuven!
>>>
>>> That would be a huge readability improvement, but it would also be a
>>> significant investment over my time budget to implement them on the Go side
>>> correctly. I would certainly want to read your documentation before going
>>> ahead.  Will the Portability FnAPI have dedicated Schema support? That
>>> would certainly change things.
>>>
>>
>> Yes, there's absolutely a plan to add schema definitions to the FnAPI.
>> This is what will allow you to use SQL from BeamGo
>>
>>>
>>> It's not clear to me how one might achieve the inversion from
>>> SchemaCoder being a special casing of CustomCoder to the other way around,
>>> since a field has a type, and that type needs to be encoded. Short of
>>> always encoding the primitive values in the way Beam prefers, it doesn't
>>> seem to allow for customizing the encoding on output, or really say
>>> anything outside of the (admittedly excellent) syntactic sugar demonstrated
>>> with the Java API.
>>>
>>
>> I'm not quite sure I understand. But schemas define a fixed set of
>> primitive types, and also define the encodings for those primitive types.
>> If a user wants custom encoding for a primitive type, they can create a
>> byte-array field and wrap that field with a Coder (this is why I said that
>> todays Coders are simply special cases); this should be very rare though,
>> as users rarely should care how Beam encodes a long or a double.
>>
>>>
>>> Offhand, Schemas seem to be an alternative to pipeline construction,
>>> rather than coders for value serialization, allowing manual field
>>> extraction code to be omitted. They do not appear to be a fundamental
>>> approach to achieve it. For example, the grouping operation still needs to
>>> encode the whole of the object as a value.
>>>
>>
>> Schemas are properties of the data - essentially a Schema is the data
>> type of a PCollection. In Java Schemas are also understood by ParDo, so you
>> can write a ParDo like this:
>>
>> @ProcessElement
>> public void process(@Field("user") String userId,  @Field("country")
>> String countryCode) {
>> }
>>
>> These extra functionalities are part of the graph, but they are enabled
>> by schemas.
>>
>>>
>>> As mentioned, I'm hoping to have a solution for existing coders by
>>> January's end, so waiting for your documentation doesn't work on that
>>> timeline.
>>>
>>
>> I don't think we need to wait for all the documentation to be written.
>>
>>
>>>
>>> That said, they aren't incompatible ideas as demonstrated by the Java
>>> implementation. The Go SDK remains in an experimental state. We can change
>>> thin

Schemas in the Go SDK

2019-01-03 Thread Robert Burke
At this point I feel like the schema discussion should be a separate thread
from having a Coder Registry in Go, which was the original topic, so I'm
forking it.

It does sounds like adding Schemas to the Go SDK would be a much larger
extension than the registry.

I'm not convinced that not having a convenient registry would serve Go SDK
users (such as they exist).

The concern I have isn't so much for Ints or Doubles, but for user types
such as Protocol Buffers, but not just those. There will be some users who
prize efficiency first, and readability second. The Go SDK presently uses
JSON encoding by default which has many of the properties of schemas, but
is severely limiting for power users.


It sounds like the following are true:
1. Full use of the Schemas in the Go SDK will require FnAPI support.
* Until the FnAPI supports it, and the semantics are implemented in the
ULR, the Go SDK probably shouldn't begin to implement against it.
* This is identical to Go's lack of SplitableDoFn keeping Beam Go pipelines
from scaling or from having Cross Language IO, which is also a precursor to
BeamGo using Beam SQL.
2. The main collision between Schemas and Coders are in the event that a
given type has both defined for it: Which is being used and when?
* This seems to me more to do with being able to enable use of the
syntactic sugar or not, but we know that at construction time, by the very
use of the sugar.
* If a file wants to materialize a file encoded with the Schema, one would
need to encode that in the DoFn doing the writing somehow (eg. ForceSchema
or ForceCoder, whichever we want to make the default). This has pipeline
compatibility implications.

It's not presently possible for Go to annotate function parameters, but
something could be worked out, similarly to how SideInputs are configured
in the Go SDK. I'd be concerned about the efficiency of those operations
though, even with Generics or code generation.


On Thu, 3 Jan 2019 at 16:33 Reuven Lax  wrote:

> On Fri, Jan 4, 2019 at 1:19 AM Robert Burke  wrote:
>
>> Very interesting Reuven!
>>
>> That would be a huge readability improvement, but it would also be a
>> significant investment over my time budget to implement them on the Go side
>> correctly. I would certainly want to read your documentation before going
>> ahead.  Will the Portability FnAPI have dedicated Schema support? That
>> would certainly change things.
>>
>
> Yes, there's absolutely a plan to add schema definitions to the FnAPI.
> This is what will allow you to use SQL from BeamGo
>
>>
>> It's not clear to me how one might achieve the inversion from SchemaCoder
>> being a special casing of CustomCoder to the other way around, since a
>> field has a type, and that type needs to be encoded. Short of always
>> encoding the primitive values in the way Beam prefers, it doesn't seem to
>> allow for customizing the encoding on output, or really say anything
>> outside of the (admittedly excellent) syntactic sugar demonstrated with the
>> Java API.
>>
>
> I'm not quite sure I understand. But schemas define a fixed set of
> primitive types, and also define the encodings for those primitive types.
> If a user wants custom encoding for a primitive type, they can create a
> byte-array field and wrap that field with a Coder (this is why I said that
> todays Coders are simply special cases); this should be very rare though,
> as users rarely should care how Beam encodes a long or a double.
>
>>
>> Offhand, Schemas seem to be an alternative to pipeline construction,
>> rather than coders for value serialization, allowing manual field
>> extraction code to be omitted. They do not appear to be a fundamental
>> approach to achieve it. For example, the grouping operation still needs to
>> encode the whole of the object as a value.
>>
>
> Schemas are properties of the data - essentially a Schema is the data type
> of a PCollection. In Java Schemas are also understood by ParDo, so you can
> write a ParDo like this:
>
> @ProcessElement
> public void process(@Field("user") String userId,  @Field("country")
> String countryCode) {
> }
>
> These extra functionalities are part of the graph, but they are enabled by
> schemas.
>
>>
>> As mentioned, I'm hoping to have a solution for existing coders by
>> January's end, so waiting for your documentation doesn't work on that
>> timeline.
>>
>
> I don't think we need to wait for all the documentation to be written.
>
>
>>
>> That said, they aren't incompatible ideas as demonstrated by the Java
>> implementation. The Go SDK remains in an experimental state. We can change
>> t