I've updated the design doc
<https://docs.google.com/document/d/1kQwx4Ah6PzG8z2ZMuNsNEXkGsLXm6gADOZaIO7reUOg/edit#>
with
a section on schemas. Interestingly, the lack of Generics in Go ends up
being very handy. No incompatibility between converting from a concrete
type, and it's Schema equivalent.
The main question on "infering" a coder can come from using them by
default, but also forcing a PCollection to be it's Schema equivalent, which
would be used.

There's then the other difficulties around the actual implementation, but
at this point I don't see user code interacting *directly* with the
recursive schema definition, but instead relying on implied conversions to
concrete types, which are easier to manipulate for the user.

To get performance, each type could have a specific a T -> Schema Schema ->
T generated for it. There are some semantics to work out there, but they
don't touch coders, so I'm not going deep into them now. In particular,
these would be necessary for converting a PCollection<Schema> to
PCollection<Concrete> for use in DoFns.

The Schema type is largely useful for the framework to manipulate types, so
in those cases, using the schema coder is obvious. For everything else, it
wouldn't be too bad to provide a ConvertToSchemaType transform from
PCollection to PCollection, which would effectively change the PCollection
to use the Schema Coding if that PCollection is being sent to a cross
language sink that requires schema coded values. Cross language sources
would already require explicitly pointing out the type to use, and for that
a user could provide an explicit Schema value (of whatever we implement it
as) to force correct decoding.

Overall, given that schemas *are not yet* in the FnAPI, and I can't
currently find any insurmountable issues between the proposal and schemas,
I'm going to start a PR for this.
Cheers,
Robert Burke

PS. I've added a link to the doc and the other Go specific ones to the
Technical/Design
Doc page of the wiki
<https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=95653903> to
make them easier to find.


On Wed, 9 Jan 2019 at 07:40, Robert Bradshaw <rober...@google.com> wrote:

> On Tue, Jan 8, 2019 at 9:15 PM Reuven Lax <re...@google.com> wrote:
> >
> > I wonder if we could do this _only_ over the FnApi. The FnApi already
> does batching I believe. What if we made schemas a fundamental part of our
> protos, and had no SchemaCoder.
>
> One advantage of SchemaCoders is that they allow nesting inside other
> coders.
>
> Schemas are not (I don't think) a replacement for coders at the
> implementation level, but in the user API they obviate the need for
> most users to interact with coders (as well as providing a richer,
> language-independent way of describing elements in most cases).
>
> > The FnApi could then batch up a bunch of rows an encode using Arrow
> before sending over the wire to the harness.
>
> The encoding of records across the FnAPI can be more expressive than
> Coders, regardless of schemas.
>
> > Of course this still turns back into individual records before it goes
> back to user code. However well-known combiners can be executed directly in
> the harness, which means aggregations like "sum a field" can be run inside
> the harness over the columnar data. Moving these combiners into the harness
> might itself be a huge perf benefit for Python, as we could then run them
> in a more-performant language.
>
> You don't even have to move them out of Python to take advantage of
> this if you're using the right libraries and have the right
> representation. If you do move them it doesn't have to be into the
> harness, it could be an adjacent SDK. I envision a large suite of
> known URNs that can be placed wherever it's best.
>
>
>
> On Tue, Jan 8, 2019 at 7:12 PM Kenneth Knowles <k...@apache.org> wrote:
> >
> > And even more for SQL, or about the same since you are referring to
> DataFrames which have roughly relational semantics. Having the columnar
> model all the way to the data source would be big. Having near-zero-parsing
> cost for transmitted data would be big. These changes would make Beam a
> rather different project.
>
> I think beam would be qualitatively the same project, but it would
> open up a lot of areas for optimization (both in the "computer
> resource" sense and "easy for people to get stuff done" sense).
>
>
> > Reuven
> >
> > On Tue, Jan 8, 2019 at 7:44 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >>
> >> On Tue, Jan 8, 2019 at 4:32 PM Reuven Lax <re...@google.com> wrote:
> >> >
> >> > I agree with this, but I think it's a significant rethinking of Beam
> that I didn't want to couple to schemas. In addition to rethinking the API,
> it might also require rethinking all of our runners.
> >>
> >> We're already marshaling (including batching) data over the FnApi, so
> >> it might not be that big of a change. Also, the choice of encoding
> >> over the data channel is already parametrizable via a coder, so it's
> >> easy to make this an optional feature that runners and SDKs can opt
> >> into. I agree that we don't want to couple it to schemas (though
> >> that's where it becomes even more useful).
> >>
> >> > Also while columnar can be a large perf win, I suspect that we
> currently have lower-hanging fruit to optimize when it comes to performance.
> >>
> >> It's probably a bigger win for Python than for Java.
> >>
> >> >
> >> > Reuven
> >> >
> >> > On Tue, Jan 8, 2019 at 5:25 AM Robert Bradshaw <rober...@google.com>
> wrote:
> >> >>
> >> >> On Fri, Jan 4, 2019 at 12:54 AM Reuven Lax <re...@google.com> wrote:
> >> >> >
> >> >> > I looked at Apache Arrow as a potential serialization format for
> Row coders. At the time it didn't seem a perfect fit - Beam's programming
> model is record-at-a-time, and Arrow is optimized for large batches of
> records (while Beam has a concept of "bundles" they are completely non
> deterministic, and records might bundle different on retry). You could use
> Arrow with single-record batches, but I suspect that would end up adding a
> lot of extra overhead. That being said, I think it's still something worth
> investigating further.
> >> >>
> >> >> Though Beam's model is row-oriented, I think it'd make a lot of sense
> >> >> to support column-oriented transfer of data across the data plane
> >> >> (we're already concatenating serialized records lengthwise), with
> >> >> Arrow as a first candidate, and (either as part of the public API or
> >> >> as an implementation detail) columnar processing as well (e.g.
> >> >> projections, maps, filters, and aggregations can often be done more
> >> >> efficiently in a columnar fashion). While this is often a significant
> >> >> win in C++ (and presumably Java), it's essential for doing
> >> >> high-performance computing in Python (e.g. Numpy, SciPy, Pandas,
> >> >> Tensorflow, ... all have batch-oriented APIs and avoid representing
> >> >> records as individual objects, something we'll need to tackle for
> >> >> BeamPython at least).
> >> >>
> >> >> >
> >> >> > Reuven
> >> >> >
> >> >> >
> >> >> >
> >> >> > On Fri, Jan 4, 2019 at 12:34 AM Gleb Kanterov <g...@spotify.com>
> wrote:
> >> >> >>
> >> >> >> Reuven, it sounds great. I see there is a similar thing to Row
> coders happening in Apache Arrow, and there is a similarity between Apache
> Arrow Flight and data exchange service in portability. How do you see these
> two things relate to each other in the long term?
> >> >> >>
> >> >> >> On Fri, Jan 4, 2019 at 12:13 AM Reuven Lax <re...@google.com>
> wrote:
> >> >> >>>
> >> >> >>> The biggest advantage is actually readability and usability. A
> secondary advantage is that it means that Go will be able to interact
> seamlessly with BeamSQL, which would be a big win for Go.
> >> >> >>>
> >> >> >>> A schema is basically a way of saying that a record has a
> specific set of (possibly nested, possibly repeated) fields. So for
> instance let's say that the user's type is a struct with fields named user,
> country, purchaseCost. This allows us to provide transforms that operate on
> field names. Some example (using the Java API):
> >> >> >>>
> >> >> >>> PCollection users = events.apply(Select.fields("user"));  //
> Select out only the user field.
> >> >> >>>
> >> >> >>> PCollection joinedEvents =
> queries.apply(Join.innerJoin(clicks).byFields("user"));  // Join two
> PCollections by user.
> >> >> >>>
> >> >> >>> // For each country, calculate the total purchase cost as well
> as the top 10 purchases.
> >> >> >>> // A new schema is created containing fields total_cost and
> top_purchases, and rows are created with the aggregation results.
> >> >> >>> PCollection purchaseStatistics = events.apply(
> >> >> >>>     Group.byFieldNames("country")
> >> >> >>>                .aggregateField("purchaseCost", Sum.ofLongs(),
> "total_cost"))
> >> >> >>>                 .aggregateField("purchaseCost",
> Top.largestLongs(10), "top_purchases"))
> >> >> >>>
> >> >> >>>
> >> >> >>> This is far more readable than what we have today, and what
> unlocks this is that Beam actually knows the structure of the record
> instead of assuming records are uncrackable blobs.
> >> >> >>>
> >> >> >>> Note that a coder is basically a special case of a schema that
> has a single field.
> >> >> >>>
> >> >> >>> In BeamJava we have a SchemaRegistry which knows how to turn
> user types into schemas. We use reflection to analyze many user types (e.g.
> simple POJO structs, JavaBean classes, Avro records, protocol buffers,
> etc.) to determine the schema, however this is done only when the graph is
> initially generated. We do use code generation (in Java we do bytecode
> generation) to make this somewhat more efficient. I'm willing to bet that
> the code generator you've written for structs could be very easily modified
> for schemas instead, so it would not be wasted work if we went with schemas.
> >> >> >>>
> >> >> >>> One of the things I'm working on now is documenting Beam
> schemas. They are already very powerful and useful, but since there is
> still nothing in our documentation about them, they are not yet widely
> used. I expect to finish draft documentation by the end of January.
> >> >> >>>
> >> >> >>> Reuven
> >> >> >>>
> >> >> >>> On Thu, Jan 3, 2019 at 11:32 PM Robert Burke <r...@google.com>
> wrote:
> >> >> >>>>
> >> >> >>>> That's an interesting idea. I must confess I don't rightly know
> the difference between a schema and coder, but here's what I've got with a
> bit of searching through memory and the mailing list. Please let me know if
> I'm off track.
> >> >> >>>>
> >> >> >>>> As near as I can tell, a schema, as far as Beam takes it is a
> mechanism to define what data is extracted from a given row of data. So in
> principle, there's an opportunity to be more efficient with data with many
> columns that aren't being used, and only extract the data that's meaningful
> to the pipeline.
> >> >> >>>> The trick then is how to apply the schema to a given
> serialization format, which is something I'm missing in my mental model
> (and then how to do it efficiently in Go).
> >> >> >>>>
> >> >> >>>> I do know that the Go client package for BigQuery does
> something like that, using field tags. Similarly, the "encoding/json"
> package in the Go Standard Library permits annotating fields and it will
> read out and deserialize the JSON fields and that's it.
> >> >> >>>>
> >> >> >>>> A concern I have is that Go (at present) would require
> pre-compile time code generation for schemas to be efficient, and they
> would still mostly boil down to turning []bytes into real structs. Go
> reflection doesn't keep up.
> >> >> >>>> Go has no mechanism I'm aware of to Just In Time compile more
> efficient processing of values.
> >> >> >>>> It's also not 100% clear how Schema's would play with protocol
> buffers or similar.
> >> >> >>>> BigQuery has a mechanism of generating a JSON schema from a
> proto file, but that's only the specification half, not the using half.
> >> >> >>>>
> >> >> >>>> As it stands, the code generator I've been building these last
> months could (in principle) statically analyze a user's struct, and then
> generate an efficient dedicated coder for it. It just has no where to put
> them such that the Go SDK would use it.
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> On Thu, Jan 3, 2019 at 1:39 PM Reuven Lax <re...@google.com>
> wrote:
> >> >> >>>>>
> >> >> >>>>> I'll make a different suggestion. There's been some chatter
> that schemas are a better tool than coders, and that in Beam 3.0 we should
> make schemas the basic semantics instead of coders. Schemas provide
> everything a coder provides, but also allows for far more readable code. We
> can't make such a change in Beam Java 2.X for compatibility reasons, but
> maybe in Go we're better off starting with schemas instead of coders?
> >> >> >>>>>
> >> >> >>>>> Reuven
> >> >> >>>>>
> >> >> >>>>> On Thu, Jan 3, 2019 at 8:45 PM Robert Burke <
> rob...@frantil.com> wrote:
> >> >> >>>>>>
> >> >> >>>>>> One area that the Go SDK currently lacks: is the ability for
> users to specify their own coders for types.
> >> >> >>>>>>
> >> >> >>>>>> I've written a proposal document, and while I'm confident
> about the core, there are certainly some edge cases that require discussion
> before getting on with the implementation.
> >> >> >>>>>>
> >> >> >>>>>> At presently, the SDK only permits primitive value types (all
> numeric types but complex, strings, and []bytes) which are coded with beam
> coders, and structs whose exported fields are of those type, which is then
> encoded as JSON. Protocol buffer support is hacked in to avoid the type
> anaiyzer, and presents the current work around this issue.
> >> >> >>>>>>
> >> >> >>>>>> The high level proposal is to catch up with Python and Java,
> and have a coder registry. In addition, arrays, and maps should be
> permitted as well.
> >> >> >>>>>>
> >> >> >>>>>> If you have alternatives, or other suggestions and opinions,
> I'd love to hear them! Otherwise my intent is to get a PR ready by the end
> of January.
> >> >> >>>>>>
> >> >> >>>>>> Thanks!
> >> >> >>>>>> Robert Burke
> >> >> >>>>
> >> >> >>>>
> >> >> >>>>
> >> >> >>>> --
> >> >> >>>> http://go/where-is-rebo
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> --
> >> >> >> Cheers,
> >> >> >> Gleb
>

Reply via email to