On Fri, Jan 4, 2019 at 1:19 AM Robert Burke <rob...@frantil.com> wrote:
> Very interesting Reuven! > > That would be a huge readability improvement, but it would also be a > significant investment over my time budget to implement them on the Go side > correctly. I would certainly want to read your documentation before going > ahead. Will the Portability FnAPI have dedicated Schema support? That > would certainly change things. > Yes, there's absolutely a plan to add schema definitions to the FnAPI. This is what will allow you to use SQL from BeamGo > > It's not clear to me how one might achieve the inversion from SchemaCoder > being a special casing of CustomCoder to the other way around, since a > field has a type, and that type needs to be encoded. Short of always > encoding the primitive values in the way Beam prefers, it doesn't seem to > allow for customizing the encoding on output, or really say anything > outside of the (admittedly excellent) syntactic sugar demonstrated with the > Java API. > I'm not quite sure I understand. But schemas define a fixed set of primitive types, and also define the encodings for those primitive types. If a user wants custom encoding for a primitive type, they can create a byte-array field and wrap that field with a Coder (this is why I said that todays Coders are simply special cases); this should be very rare though, as users rarely should care how Beam encodes a long or a double. > > Offhand, Schemas seem to be an alternative to pipeline construction, > rather than coders for value serialization, allowing manual field > extraction code to be omitted. They do not appear to be a fundamental > approach to achieve it. For example, the grouping operation still needs to > encode the whole of the object as a value. > Schemas are properties of the data - essentially a Schema is the data type of a PCollection. In Java Schemas are also understood by ParDo, so you can write a ParDo like this: @ProcessElement public void process(@Field("user") String userId, @Field("country") String countryCode) { } These extra functionalities are part of the graph, but they are enabled by schemas. > > As mentioned, I'm hoping to have a solution for existing coders by > January's end, so waiting for your documentation doesn't work on that > timeline. > I don't think we need to wait for all the documentation to be written. > > That said, they aren't incompatible ideas as demonstrated by the Java > implementation. The Go SDK remains in an experimental state. We can change > things should the need arise in the next few months. Further, whenever > Generics > in Go > <https://go.googlesource.com/proposal/+/master/design/go2draft-generics-overview.md> > crop up, the existing user surface and execution stack will need to be > re-written to take advantage of them anyway. That provides an opportunity > to invert Coder vs Schema dependence while getting a nice performance > boost, and cleaner code (and deleting much of my code generator). > > ---- > > Were I to implement schemas to get the same syntatic benefits as the Java > API, I'd be leveraging the field annotations Go has. This satisfies the > protocol buffer issue as well, since generated go protos have name & json > annotations. Schemas could be extracted that way. These are also available > to anything using static analysis for more direct generation of accessors. > The reflective approach would also work, which is excellent for development > purposes. > > The rote code that the schemas were replacing would be able to be cobbled > together into efficient DoFn and CombineFns for serialization. At present, > it seems like it could be implemented as a side package that uses beam, > rather than changing portions of the core beam Go packages, The real trick > would be to do so without "apply" since that's not how the Go SDK is shaped. > > > > > On Thu, 3 Jan 2019 at 15:34 Gleb Kanterov <g...@spotify.com> wrote: > >> Reuven, it sounds great. I see there is a similar thing to Row coders >> happening in Apache Arrow <https://arrow.apache.org>, and there is a >> similarity between Apache Arrow Flight >> <https://www.slideshare.net/wesm/apache-arrow-at-dataengconf-barcelona-2018/23> >> and data exchange service in portability. How do you see these two things >> relate to each other in the long term? >> >> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax <re...@google.com> wrote: >> >>> The biggest advantage is actually readability and usability. A secondary >>> advantage is that it means that Go will be able to interact seamlessly with >>> BeamSQL, which would be a big win for Go. >>> >>> A schema is basically a way of saying that a record has a specific set >>> of (possibly nested, possibly repeated) fields. So for instance let's say >>> that the user's type is a struct with fields named user, country, >>> purchaseCost. This allows us to provide transforms that operate on field >>> names. Some example (using the Java API): >>> >>> PCollection users = events.apply(Select.fields("user")); // Select out >>> only the user field. >>> >>> PCollection joinedEvents = >>> queries.apply(Join.innerJoin(clicks).byFields("user")); // Join two >>> PCollections by user. >>> >>> // For each country, calculate the total purchase cost as well as the >>> top 10 purchases. >>> // A new schema is created containing fields total_cost and >>> top_purchases, and rows are created with the aggregation results. >>> PCollection purchaseStatistics = events.apply( >>> Group.byFieldNames("country") >>> .aggregateField("purchaseCost", Sum.ofLongs(), >>> "total_cost")) >>> .aggregateField("purchaseCost", Top.largestLongs(10), >>> "top_purchases")) >>> >>> >>> This is far more readable than what we have today, and what unlocks this >>> is that Beam actually knows the structure of the record instead of assuming >>> records are uncrackable blobs. >>> >>> Note that a coder is basically a special case of a schema that has a >>> single field. >>> >>> In BeamJava we have a SchemaRegistry which knows how to turn user types >>> into schemas. We use reflection to analyze many user types (e.g. simple >>> POJO structs, JavaBean classes, Avro records, protocol buffers, etc.) to >>> determine the schema, however this is done only when the graph is initially >>> generated. We do use code generation (in Java we do bytecode generation) to >>> make this somewhat more efficient. I'm willing to bet that the code >>> generator you've written for structs could be very easily modified for >>> schemas instead, so it would not be wasted work if we went with schemas. >>> >>> One of the things I'm working on now is documenting Beam schemas. They >>> are already very powerful and useful, but since there is still nothing in >>> our documentation about them, they are not yet widely used. I expect to >>> finish draft documentation by the end of January. >>> >>> Reuven >>> >>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke <r...@google.com> wrote: >>> >>>> That's an interesting idea. I must confess I don't rightly know the >>>> difference between a schema and coder, but here's what I've got with a bit >>>> of searching through memory and the mailing list. Please let me know if I'm >>>> off track. >>>> >>>> As near as I can tell, a schema, as far as Beam takes it >>>> <https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/Schema.java> >>>> is >>>> a mechanism to define what data is extracted from a given row of data. So >>>> in principle, there's an opportunity to be more efficient with data with >>>> many columns that aren't being used, and only extract the data that's >>>> meaningful to the pipeline. >>>> The trick then is how to apply the schema to a given serialization >>>> format, which is something I'm missing in my mental model (and then how to >>>> do it efficiently in Go). >>>> >>>> I do know that the Go client package for BigQuery >>>> <https://godoc.org/cloud.google.com/go/bigquery#hdr-Schemas> does >>>> something like that, using field tags. Similarly, the "encoding/json" >>>> <https://golang.org/doc/articles/json_and_go.html> package in the Go >>>> Standard Library permits annotating fields and it will read out and >>>> deserialize the JSON fields and that's it. >>>> >>>> A concern I have is that Go (at present) would require pre-compile time >>>> code generation for schemas to be efficient, and they would still mostly >>>> boil down to turning []bytes into real structs. Go reflection doesn't keep >>>> up. >>>> Go has no mechanism I'm aware of to Just In Time compile more efficient >>>> processing of values. >>>> It's also not 100% clear how Schema's would play with protocol buffers >>>> or similar. >>>> BigQuery has a mechanism of generating a JSON schema from a proto file >>>> <https://github.com/GoogleCloudPlatform/protoc-gen-bq-schema>, but >>>> that's only the specification half, not the using half. >>>> >>>> As it stands, the code generator I've been building these last months >>>> could (in principle) statically analyze a user's struct, and then generate >>>> an efficient dedicated coder for it. It just has no where to put them such >>>> that the Go SDK would use it. >>>> >>>> >>>> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax <re...@google.com> wrote: >>>> >>>>> I'll make a different suggestion. There's been some chatter that >>>>> schemas are a better tool than coders, and that in Beam 3.0 we should make >>>>> schemas the basic semantics instead of coders. Schemas provide everything >>>>> a >>>>> coder provides, but also allows for far more readable code. We can't make >>>>> such a change in Beam Java 2.X for compatibility reasons, but maybe in Go >>>>> we're better off starting with schemas instead of coders? >>>>> >>>>> Reuven >>>>> >>>>> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke <rob...@frantil.com> >>>>> wrote: >>>>> >>>>>> One area that the Go SDK currently lacks: is the ability for users to >>>>>> specify their own coders for types. >>>>>> >>>>>> I've written a proposal document, >>>>>> <https://docs.google.com/document/d/1kQwx4Ah6PzG8z2ZMuNsNEXkGsLXm6gADOZaIO7reUOg/edit#> >>>>>> and >>>>>> while I'm confident about the core, there are certainly some edge cases >>>>>> that require discussion before getting on with the implementation. >>>>>> >>>>>> At presently, the SDK only permits primitive value types (all numeric >>>>>> types but complex, strings, and []bytes) which are coded with beam >>>>>> coders, >>>>>> and structs whose exported fields are of those type, which is then >>>>>> encoded >>>>>> as JSON. Protocol buffer support is hacked in to avoid the type anaiyzer, >>>>>> and presents the current work around this issue. >>>>>> >>>>>> The high level proposal is to catch up with Python and Java, and have >>>>>> a coder registry. In addition, arrays, and maps should be permitted as >>>>>> well. >>>>>> >>>>>> If you have alternatives, or other suggestions and opinions, I'd love >>>>>> to hear them! Otherwise my intent is to get a PR ready by the end of >>>>>> January. >>>>>> >>>>>> Thanks! >>>>>> Robert Burke >>>>>> >>>>> >>>> >>>> -- >>>> http://go/where-is-rebo >>>> >>> >> >> -- >> Cheers, >> Gleb >> >