Schemas in the Go SDK

Robert Burke Thu, 03 Jan 2019 17:52:37 -0800

At this point I feel like the schema discussion should be a separate thread
from having a Coder Registry in Go, which was the original topic, so I'm
forking it.


It does sounds like adding Schemas to the Go SDK would be a much larger
extension than the registry.

I'm not convinced that not having a convenient registry would serve Go SDK
users (such as they exist).

The concern I have isn't so much for Ints or Doubles, but for user types
such as Protocol Buffers, but not just those. There will be some users who
prize efficiency first, and readability second. The Go SDK presently uses
JSON encoding by default which has many of the properties of schemas, but
is severely limiting for power users.


It sounds like the following are true:
1. Full use of the Schemas in the Go SDK will require FnAPI support.
* Until the FnAPI supports it, and the semantics are implemented in the
ULR, the Go SDK probably shouldn't begin to implement against it.
* This is identical to Go's lack of SplitableDoFn keeping Beam Go pipelines
from scaling or from having Cross Language IO, which is also a precursor to
BeamGo using Beam SQL.
2. The main collision between Schemas and Coders are in the event that a
given type has both defined for it: Which is being used and when?
* This seems to me more to do with being able to enable use of the
syntactic sugar or not, but we know that at construction time, by the very
use of the sugar.
* If a file wants to materialize a file encoded with the Schema, one would
need to encode that in the DoFn doing the writing somehow (eg. ForceSchema
or ForceCoder, whichever we want to make the default). This has pipeline
compatibility implications.

It's not presently possible for Go to annotate function parameters, but
something could be worked out, similarly to how SideInputs are configured
in the Go SDK. I'd be concerned about the efficiency of those operations
though, even with Generics or code generation.


On Thu, 3 Jan 2019 at 16:33 Reuven Lax <[email protected]> wrote:

> On Fri, Jan 4, 2019 at 1:19 AM Robert Burke <[email protected]> wrote:
>
>> Very interesting Reuven!
>>
>> That would be a huge readability improvement, but it would also be a
>> significant investment over my time budget to implement them on the Go side
>> correctly. I would certainly want to read your documentation before going
>> ahead.  Will the Portability FnAPI have dedicated Schema support? That
>> would certainly change things.
>>
>
> Yes, there's absolutely a plan to add schema definitions to the FnAPI.
> This is what will allow you to use SQL from BeamGo
>
>>
>> It's not clear to me how one might achieve the inversion from SchemaCoder
>> being a special casing of CustomCoder to the other way around, since a
>> field has a type, and that type needs to be encoded. Short of always
>> encoding the primitive values in the way Beam prefers, it doesn't seem to
>> allow for customizing the encoding on output, or really say anything
>> outside of the (admittedly excellent) syntactic sugar demonstrated with the
>> Java API.
>>
>
> I'm not quite sure I understand. But schemas define a fixed set of
> primitive types, and also define the encodings for those primitive types.
> If a user wants custom encoding for a primitive type, they can create a
> byte-array field and wrap that field with a Coder (this is why I said that
> todays Coders are simply special cases); this should be very rare though,
> as users rarely should care how Beam encodes a long or a double.
>
>>
>> Offhand, Schemas seem to be an alternative to pipeline construction,
>> rather than coders for value serialization, allowing manual field
>> extraction code to be omitted. They do not appear to be a fundamental
>> approach to achieve it. For example, the grouping operation still needs to
>> encode the whole of the object as a value.
>>
>
> Schemas are properties of the data - essentially a Schema is the data type
> of a PCollection. In Java Schemas are also understood by ParDo, so you can
> write a ParDo like this:
>
> @ProcessElement
> public void process(@Field("user") String userId,  @Field("country")
> String countryCode) {
> }
>
> These extra functionalities are part of the graph, but they are enabled by
> schemas.
>
>>
>> As mentioned, I'm hoping to have a solution for existing coders by
>> January's end, so waiting for your documentation doesn't work on that
>> timeline.
>>
>
> I don't think we need to wait for all the documentation to be written.
>
>
>>
>> That said, they aren't incompatible ideas as demonstrated by the Java
>> implementation. The Go SDK remains in an experimental state. We can change
>> things should the need arise in the next few months. Further, whenever 
>> Generics
>> in Go
>> <https://go.googlesource.com/proposal/+/master/design/go2draft-generics-overview.md>
>> crop up, the existing user surface and execution stack will need to be
>> re-written to take advantage of them anyway. That provides an opportunity
>> to invert Coder vs Schema dependence while getting a nice performance
>> boost, and cleaner code (and deleting much of my code generator).
>>
>> ----
>>
>> Were I to implement schemas to get the same syntatic benefits as the Java
>> API, I'd be leveraging the field annotations Go has. This satisfies the
>> protocol buffer issue as well, since generated go protos have name & json
>> annotations. Schemas could be extracted that way. These are also available
>> to anything using static analysis for more direct generation of accessors.
>> The reflective approach would also work, which is excellent for development
>> purposes.
>>
>> The rote code that the schemas were replacing would be able to be cobbled
>> together into efficient DoFn and CombineFns for serialization. At present,
>> it seems like it could be implemented as a side package that uses beam,
>> rather than changing portions of the core beam Go packages, The real trick
>> would be to do so without "apply" since that's not how the Go SDK is shaped.
>>
>>
>>
>>
>> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov <[email protected]> wrote:
>>
>>> Reuven, it sounds great. I see there is a similar thing to Row coders
>>> happening in Apache Arrow <https://arrow.apache.org>, and there is a
>>> similarity between Apache Arrow Flight
>>> <https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018/23>
>>> and data exchange service in portability. How do you see these two things
>>> relate to each other in the long term?
>>>
>>> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax <[email protected]> wrote:
>>>
>>>> The biggest advantage is actually readability and usability. A
>>>> secondary advantage is that it means that Go will be able to interact
>>>> seamlessly with BeamSQL, which would be a big win for Go.
>>>>
>>>> A schema is basically a way of saying that a record has a specific set
>>>> of (possibly nested, possibly repeated) fields. So for instance let's say
>>>> that the user's type is a struct with fields named user, country,
>>>> purchaseCost. This allows us to provide transforms that operate on field
>>>> names. Some example (using the Java API):
>>>>
>>>> PCollection users = events.apply(Select.fields("user"));  // Select out
>>>> only the user field.
>>>>
>>>> PCollection joinedEvents =
>>>> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
>>>> PCollections by user.
>>>>
>>>> // For each country, calculate the total purchase cost as well as the
>>>> top 10 purchases.
>>>> // A new schema is created containing fields total_cost and
>>>> top_purchases, and rows are created with the aggregation results.
>>>> PCollection purchaseStatistics = events.apply(
>>>>     Group.byFieldNames("country")
>>>>                .aggregateField("purchaseCost", Sum.ofLongs(),
>>>> "total_cost"))
>>>>                 .aggregateField("purchaseCost", Top.largestLongs(10),
>>>> "top_purchases"))
>>>>
>>>>
>>>> This is far more readable than what we have today, and what unlocks
>>>> this is that Beam actually knows the structure of the record instead of
>>>> assuming records are uncrackable blobs.
>>>>
>>>> Note that a coder is basically a special case of a schema that has a
>>>> single field.
>>>>
>>>> In BeamJava we have a SchemaRegistry which knows how to turn user types
>>>> into schemas. We use reflection to analyze many user types (e.g. simple
>>>> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
>>>> determine the schema, however this is done only when the graph is initially
>>>> generated. We do use code generation (in Java we do bytecode generation) to
>>>> make this somewhat more efficient. I'm willing to bet that the code
>>>> generator you've written for structs could be very easily modified for
>>>> schemas instead, so it would not be wasted work if we went with schemas.
>>>>
>>>> One of the things I'm working on now is documenting Beam schemas. They
>>>> are already very powerful and useful, but since there is still nothing in
>>>> our documentation about them, they are not yet widely used. I expect to
>>>> finish draft documentation by the end of January.
>>>>
>>>> Reuven
>>>>
>>>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke <[email protected]> wrote:
>>>>
>>>>> That's an interesting idea. I must confess I don't rightly know the
>>>>> difference between a schema and coder, but here's what I've got with a bit
>>>>> of searching through memory and the mailing list. Please let me know if 
>>>>> I'm
>>>>> off track.
>>>>>
>>>>> As near as I can tell, a schema, as far as Beam takes it
>>>>> <https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java>
>>>>>  is
>>>>> a mechanism to define what data is extracted from a given row of data. So
>>>>> in principle, there's an opportunity to be more efficient with data with
>>>>> many columns that aren't being used, and only extract the data that's
>>>>> meaningful to the pipeline.
>>>>> The trick then is how to apply the schema to a given serialization
>>>>> format, which is something I'm missing in my mental model (and then how to
>>>>> do it efficiently in Go).
>>>>>
>>>>> I do know that the Go client package for BigQuery
>>>>> <https://godoc.org/cloud.google.com/go/bigquery#hdr-Schemas> does
>>>>> something like that, using field tags. Similarly, the "encoding/json"
>>>>> <https://golang.org/doc/articles/json_and_go.html> package in the Go
>>>>> Standard Library permits annotating fields and it will read out and
>>>>> deserialize the JSON fields and that's it.
>>>>>
>>>>> A concern I have is that Go (at present) would require pre-compile
>>>>> time code generation for schemas to be efficient, and they would still
>>>>> mostly boil down to turning []bytes into real structs. Go reflection
>>>>> doesn't keep up.
>>>>> Go has no mechanism I'm aware of to Just In Time compile more
>>>>> efficient processing of values.
>>>>> It's also not 100% clear how Schema's would play with protocol buffers
>>>>> or similar.
>>>>> BigQuery has a mechanism of generating a JSON schema from a proto file
>>>>> <https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema>, but
>>>>> that's only the specification half, not the using half.
>>>>>
>>>>> As it stands, the code generator I've been building these last months
>>>>> could (in principle) statically analyze a user's struct, and then generate
>>>>> an efficient dedicated coder for it. It just has no where to put them such
>>>>> that the Go SDK would use it.
>>>>>
>>>>>
>>>>> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax <[email protected]> wrote:
>>>>>
>>>>>> I'll make a different suggestion. There's been some chatter that
>>>>>> schemas are a better tool than coders, and that in Beam 3.0 we should 
>>>>>> make
>>>>>> schemas the basic semantics instead of coders. Schemas provide 
>>>>>> everything a
>>>>>> coder provides, but also allows for far more readable code. We can't make
>>>>>> such a change in Beam Java 2.X for compatibility reasons, but maybe in Go
>>>>>> we're better off starting with schemas instead of coders?
>>>>>>
>>>>>> Reuven
>>>>>>
>>>>>> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> One area that the Go SDK currently lacks: is the ability for users
>>>>>>> to specify their own coders for types.
>>>>>>>
>>>>>>> I've written a proposal document,
>>>>>>> <https://docs.google.com/document/d/1kQwx4Ah6PzG8z2ZMuNsNEXkGsLXm6gADOZaIO7reUOg/edit#>
>>>>>>>  and
>>>>>>> while I'm confident about the core, there are certainly some edge cases
>>>>>>> that require discussion before getting on with the implementation.
>>>>>>>
>>>>>>> At presently, the SDK only permits primitive value types (all
>>>>>>> numeric types but complex, strings, and []bytes) which are coded with 
>>>>>>> beam
>>>>>>> coders, and structs whose exported fields are of those type, which is 
>>>>>>> then
>>>>>>> encoded as JSON. Protocol buffer support is hacked in to avoid the type
>>>>>>> anaiyzer, and presents the current work around this issue.
>>>>>>>
>>>>>>> The high level proposal is to catch up with Python and Java, and
>>>>>>> have a coder registry. In addition, arrays, and maps should be 
>>>>>>> permitted as
>>>>>>> well.
>>>>>>>
>>>>>>> If you have alternatives, or other suggestions and opinions, I'd
>>>>>>> love to hear them! Otherwise my intent is to get a PR ready by the end 
>>>>>>> of
>>>>>>> January.
>>>>>>>
>>>>>>> Thanks!
>>>>>>> Robert Burke
>>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> http://go/where-is-rebo
>>>>>
>>>>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>

Schemas in the Go SDK

Reply via email to