Re: [Go SDK] User Defined Coders

Reuven Lax Thu, 03 Jan 2019 16:33:19 -0800

On Fri, Jan 4, 2019 at 1:19 AM Robert Burke <[email protected]> wrote:


> Very interesting Reuven!
>
> That would be a huge readability improvement, but it would also be a
> significant investment over my time budget to implement them on the Go side
> correctly. I would certainly want to read your documentation before going
> ahead.  Will the Portability FnAPI have dedicated Schema support? That
> would certainly change things.
>

Yes, there's absolutely a plan to add schema definitions to the FnAPI. This
is what will allow you to use SQL from BeamGo

>
> It's not clear to me how one might achieve the inversion from SchemaCoder
> being a special casing of CustomCoder to the other way around, since a
> field has a type, and that type needs to be encoded. Short of always
> encoding the primitive values in the way Beam prefers, it doesn't seem to
> allow for customizing the encoding on output, or really say anything
> outside of the (admittedly excellent) syntactic sugar demonstrated with the
> Java API.
>

I'm not quite sure I understand. But schemas define a fixed set of
primitive types, and also define the encodings for those primitive types.
If a user wants custom encoding for a primitive type, they can create a
byte-array field and wrap that field with a Coder (this is why I said that
todays Coders are simply special cases); this should be very rare though,
as users rarely should care how Beam encodes a long or a double.

>
> Offhand, Schemas seem to be an alternative to pipeline construction,
> rather than coders for value serialization, allowing manual field
> extraction code to be omitted. They do not appear to be a fundamental
> approach to achieve it. For example, the grouping operation still needs to
> encode the whole of the object as a value.
>

Schemas are properties of the data - essentially a Schema is the data type
of a PCollection. In Java Schemas are also understood by ParDo, so you can
write a ParDo like this:

@ProcessElement
public void process(@Field("user") String userId,  @Field("country") String
countryCode) {
}

These extra functionalities are part of the graph, but they are enabled by
schemas.

>
> As mentioned, I'm hoping to have a solution for existing coders by
> January's end, so waiting for your documentation doesn't work on that
> timeline.
>

I don't think we need to wait for all the documentation to be written.


>
> That said, they aren't incompatible ideas as demonstrated by the Java
> implementation. The Go SDK remains in an experimental state. We can change
> things should the need arise in the next few months. Further, whenever 
> Generics
> in Go
> <https://go.googlesource.com/proposal/+/master/design/go2draft-generics-overview.md>
> crop up, the existing user surface and execution stack will need to be
> re-written to take advantage of them anyway. That provides an opportunity
> to invert Coder vs Schema dependence while getting a nice performance
> boost, and cleaner code (and deleting much of my code generator).
>
> ----
>
> Were I to implement schemas to get the same syntatic benefits as the Java
> API, I'd be leveraging the field annotations Go has. This satisfies the
> protocol buffer issue as well, since generated go protos have name & json
> annotations. Schemas could be extracted that way. These are also available
> to anything using static analysis for more direct generation of accessors.
> The reflective approach would also work, which is excellent for development
> purposes.
>
> The rote code that the schemas were replacing would be able to be cobbled
> together into efficient DoFn and CombineFns for serialization. At present,
> it seems like it could be implemented as a side package that uses beam,
> rather than changing portions of the core beam Go packages, The real trick
> would be to do so without "apply" since that's not how the Go SDK is shaped.
>
>
>
>
> On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov <[email protected]> wrote:
>
>> Reuven, it sounds great. I see there is a similar thing to Row coders
>> happening in Apache Arrow <https://arrow.apache.org>, and there is a
>> similarity between Apache Arrow Flight
>> <https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018/23>
>> and data exchange service in portability. How do you see these two things
>> relate to each other in the long term?
>>
>> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax <[email protected]> wrote:
>>
>>> The biggest advantage is actually readability and usability. A secondary
>>> advantage is that it means that Go will be able to interact seamlessly with
>>> BeamSQL, which would be a big win for Go.
>>>
>>> A schema is basically a way of saying that a record has a specific set
>>> of (possibly nested, possibly repeated) fields. So for instance let's say
>>> that the user's type is a struct with fields named user, country,
>>> purchaseCost. This allows us to provide transforms that operate on field
>>> names. Some example (using the Java API):
>>>
>>> PCollection users = events.apply(Select.fields("user"));  // Select out
>>> only the user field.
>>>
>>> PCollection joinedEvents =
>>> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
>>> PCollections by user.
>>>
>>> // For each country, calculate the total purchase cost as well as the
>>> top 10 purchases.
>>> // A new schema is created containing fields total_cost and
>>> top_purchases, and rows are created with the aggregation results.
>>> PCollection purchaseStatistics = events.apply(
>>>     Group.byFieldNames("country")
>>>                .aggregateField("purchaseCost", Sum.ofLongs(),
>>> "total_cost"))
>>>                 .aggregateField("purchaseCost", Top.largestLongs(10),
>>> "top_purchases"))
>>>
>>>
>>> This is far more readable than what we have today, and what unlocks this
>>> is that Beam actually knows the structure of the record instead of assuming
>>> records are uncrackable blobs.
>>>
>>> Note that a coder is basically a special case of a schema that has a
>>> single field.
>>>
>>> In BeamJava we have a SchemaRegistry which knows how to turn user types
>>> into schemas. We use reflection to analyze many user types (e.g. simple
>>> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to
>>> determine the schema, however this is done only when the graph is initially
>>> generated. We do use code generation (in Java we do bytecode generation) to
>>> make this somewhat more efficient. I'm willing to bet that the code
>>> generator you've written for structs could be very easily modified for
>>> schemas instead, so it would not be wasted work if we went with schemas.
>>>
>>> One of the things I'm working on now is documenting Beam schemas. They
>>> are already very powerful and useful, but since there is still nothing in
>>> our documentation about them, they are not yet widely used. I expect to
>>> finish draft documentation by the end of January.
>>>
>>> Reuven
>>>
>>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke <[email protected]> wrote:
>>>
>>>> That's an interesting idea. I must confess I don't rightly know the
>>>> difference between a schema and coder, but here's what I've got with a bit
>>>> of searching through memory and the mailing list. Please let me know if I'm
>>>> off track.
>>>>
>>>> As near as I can tell, a schema, as far as Beam takes it
>>>> <https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java>
>>>>  is
>>>> a mechanism to define what data is extracted from a given row of data. So
>>>> in principle, there's an opportunity to be more efficient with data with
>>>> many columns that aren't being used, and only extract the data that's
>>>> meaningful to the pipeline.
>>>> The trick then is how to apply the schema to a given serialization
>>>> format, which is something I'm missing in my mental model (and then how to
>>>> do it efficiently in Go).
>>>>
>>>> I do know that the Go client package for BigQuery
>>>> <https://godoc.org/cloud.google.com/go/bigquery#hdr-Schemas> does
>>>> something like that, using field tags. Similarly, the "encoding/json"
>>>> <https://golang.org/doc/articles/json_and_go.html> package in the Go
>>>> Standard Library permits annotating fields and it will read out and
>>>> deserialize the JSON fields and that's it.
>>>>
>>>> A concern I have is that Go (at present) would require pre-compile time
>>>> code generation for schemas to be efficient, and they would still mostly
>>>> boil down to turning []bytes into real structs. Go reflection doesn't keep
>>>> up.
>>>> Go has no mechanism I'm aware of to Just In Time compile more efficient
>>>> processing of values.
>>>> It's also not 100% clear how Schema's would play with protocol buffers
>>>> or similar.
>>>> BigQuery has a mechanism of generating a JSON schema from a proto file
>>>> <https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema>, but
>>>> that's only the specification half, not the using half.
>>>>
>>>> As it stands, the code generator I've been building these last months
>>>> could (in principle) statically analyze a user's struct, and then generate
>>>> an efficient dedicated coder for it. It just has no where to put them such
>>>> that the Go SDK would use it.
>>>>
>>>>
>>>> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax <[email protected]> wrote:
>>>>
>>>>> I'll make a different suggestion. There's been some chatter that
>>>>> schemas are a better tool than coders, and that in Beam 3.0 we should make
>>>>> schemas the basic semantics instead of coders. Schemas provide everything 
>>>>> a
>>>>> coder provides, but also allows for far more readable code. We can't make
>>>>> such a change in Beam Java 2.X for compatibility reasons, but maybe in Go
>>>>> we're better off starting with schemas instead of coders?
>>>>>
>>>>> Reuven
>>>>>
>>>>> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> One area that the Go SDK currently lacks: is the ability for users to
>>>>>> specify their own coders for types.
>>>>>>
>>>>>> I've written a proposal document,
>>>>>> <https://docs.google.com/document/d/1kQwx4Ah6PzG8z2ZMuNsNEXkGsLXm6gADOZaIO7reUOg/edit#>
>>>>>>  and
>>>>>> while I'm confident about the core, there are certainly some edge cases
>>>>>> that require discussion before getting on with the implementation.
>>>>>>
>>>>>> At presently, the SDK only permits primitive value types (all numeric
>>>>>> types but complex, strings, and []bytes) which are coded with beam 
>>>>>> coders,
>>>>>> and structs whose exported fields are of those type, which is then 
>>>>>> encoded
>>>>>> as JSON. Protocol buffer support is hacked in to avoid the type anaiyzer,
>>>>>> and presents the current work around this issue.
>>>>>>
>>>>>> The high level proposal is to catch up with Python and Java, and have
>>>>>> a coder registry. In addition, arrays, and maps should be permitted as 
>>>>>> well.
>>>>>>
>>>>>> If you have alternatives, or other suggestions and opinions, I'd love
>>>>>> to hear them! Otherwise my intent is to get a PR ready by the end of
>>>>>> January.
>>>>>>
>>>>>> Thanks!
>>>>>> Robert Burke
>>>>>>
>>>>>
>>>>
>>>> --
>>>> http://go/where-is-rebo
>>>>
>>>
>>
>> --
>> Cheers,
>> Gleb
>>
>

Re: [Go SDK] User Defined Coders

Reply via email to