Re: [DISCUSS] Portability representation of schemas

2019-09-03 Thread Gleb Kanterov
Thanks, Brian. It makes sense, it wasn't entirely clear from commit messages, that's why I wanted to double check. On Tue, Sep 3, 2019 at 5:43 PM Brian Hulette wrote: > Hey Gleb, thanks for bringing this up. The PR that was reverted (8853) is > the same one that I referred to earlier in this

Re: [DISCUSS] Portability representation of schemas

2019-09-03 Thread Brian Hulette
Hey Gleb, thanks for bringing this up. The PR that was reverted (8853) is the same one that I referred to earlier in this thread. It modified the existing portable schema representation to match what we settled on here - and yes it removed support for logical types like fixed bytes. I (foolishly)

Re: [DISCUSS] Portability representation of schemas

2019-09-03 Thread Gleb Kanterov
Recently there was a pull request (that was reverted) for adding portable representation of schemas. It's great to see things moving forward, I'm worried that it doesn't support any logical types, especially fixed bytes. That makes runners using portable schemas unusable, for instance, when

Re: [DISCUSS] Portability representation of schemas

2019-06-19 Thread Kenneth Knowles
Seems like a practical approach to get moving on things. Just to restate my understanding: - in Java it is PCollection but with the row coder holding to/from/clazz (I'm calling it row coder because row is the binary format, while schemas may have multiple possible formats) - in portability, the

Re: [DISCUSS] Portability representation of schemas

2019-06-19 Thread Reuven Lax
Robert, you are correct that in principle the to/from functions are needed on the operation, as that's where automatic conversion happens (in Java it happens in DoFnRunner). However there are two blockers there: 1. As Brian mentioned, the issue in Java is that we never have PCollection in this

Re: [DISCUSS] Portability representation of schemas

2019-06-18 Thread Robert Bradshaw
Thanks for updating that alternative. As for the to/from functions, it does seem pragmatic to dangle them off the purely portable representation (either as a field there, or as an opaque logical type whose payload contains the to/from functions, or a separate coder that wraps the schema coder

Re: [DISCUSS] Portability representation of schemas

2019-06-17 Thread Brian Hulette
Realized I completely ignored one of your points, added another response inline. On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote: > On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote: > > > > Spoke to Brian about his proposal. It is essentially this: > > > > We create PortableSchemaCoder,

Re: [DISCUSS] Portability representation of schemas

2019-06-14 Thread Brian Hulette
On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote: > On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote: > > > > Spoke to Brian about his proposal. It is essentially this: > > > > We create PortableSchemaCoder, with a well-known URN. This coder is > parameterized by the schema (i.e. list of

Re: [DISCUSS] Portability representation of schemas

2019-06-14 Thread Robert Bradshaw
On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote: > > Spoke to Brian about his proposal. It is essentially this: > > We create PortableSchemaCoder, with a well-known URN. This coder is > parameterized by the schema (i.e. list of field name -> field type pairs). Given that we have a field type

Re: [DISCUSS] Portability representation of schemas

2019-06-13 Thread Brian Hulette
Yes that's pretty much what I had in mind. The one point I'm unsure about is that I was thinking the *calling* SDK would need to insert the transform to convert to/from Rows (unless it's an SDK that uses the portable SchemaCoder everywhere and doesn't need a conversion). For example, python might

Re: [DISCUSS] Portability representation of schemas

2019-06-13 Thread Reuven Lax
Spoke to Brian about his proposal. It is essentially this: We create PortableSchemaCoder, with a well-known URN. This coder is parameterized by the schema (i.e. list of field name -> field type pairs). Java also continues to have its own CustomSchemaCoder. This is parameterized by the schema as

Re: [DISCUSS] Portability representation of schemas

2019-06-13 Thread Robert Bradshaw
On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax wrote: > > On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles wrote: > >> Can we choose a first step? I feel there's consensus around: >> >> - the basic idea of what a schema looks like, ignoring logical types or >> SDK-specific bits >> - the version of

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Kenneth Knowles
Can we choose a first step? I feel there's consensus around: - the basic idea of what a schema looks like, ignoring logical types or SDK-specific bits - the version of logical type which is a standardized URN+payload plus a representation Perhaps we could commit this and see what it looks like

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Reuven Lax
On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw wrote: > On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote: > >> >> I believe the schema registry is a transient construction-time concept. I >> don't think there's any need for a concept of a registry in the portable >> representation. >> >>

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Robert Bradshaw
If we go with Reuven's (2) then a logical type like urn:beam:logical:javasdk is not constraining at all--any SDK/runner that does not understand this can simply act on its representation (and if it does not understand that, look at it's representation, all the way back to primitives). However, I

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Robert Bradshaw
On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote: > > I believe the schema registry is a transient construction-time concept. I > don't think there's any need for a concept of a registry in the portable > representation. > > I'd rather urn:beam:schema:logicaltype:javasdk not be used

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Brian Hulette
On Wed, Jun 12, 2019 at 2:01 PM Reuven Lax wrote: > Two thoughts here: > > 1. I don't think we should worry about the to/from functions much here. > From the "portable" perspective, I think the schema should be all that's > necessary. A given SDK - say the Java SDK - might want to present a

Re: [DISCUSS] Portability representation of schemas

2019-06-12 Thread Reuven Lax
Two thoughts here: 1. I don't think we should worry about the to/from functions much here. >From the "portable" perspective, I think the schema should be all that's necessary. A given SDK - say the Java SDK - might want to present a nicer programming interface by allowing users to use the types

Re: [DISCUSS] Portability representation of schemas

2019-06-11 Thread Kenneth Knowles
Snipping because the context is getting out of hand. On Mon, Jun 10, 2019 at 3:42 PM Robert Bradshaw wrote: > On Mon, Jun 10, 2019 at 11:53 PM Kenneth Knowles wrote: > >> Most things you would do directly to a representation without knowing >> what it represents are going to be nonsense. But

Re: [DISCUSS] Portability representation of schemas

2019-06-10 Thread Robert Bradshaw
On Mon, Jun 10, 2019 at 11:53 PM Kenneth Knowles wrote: > Good points. At a high level it doesn't sound like anything is blocking, > right? > It doesn't sound like we've settled on an actual proto definition yet. which may be influenced by the questions below. > On Mon, Jun 10, 2019 at 2:14

Re: [DISCUSS] Portability representation of schemas

2019-06-10 Thread Kenneth Knowles
Good points. At a high level it doesn't sound like anything is blocking, right? On Mon, Jun 10, 2019 at 2:14 AM Robert Bradshaw wrote: > On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote: > >> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote: >> >>> Wouldn't SDK specific types always be

Re: [DISCUSS] Portability representation of schemas

2019-06-10 Thread Robert Bradshaw
On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote: > On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote: > >> Wouldn't SDK specific types always be under the "coders" component >> instead of the logical type listing? >> >> Offhand, having a separate normalized listing of logical schema types

Re: [DISCUSS] Portability representation of schemas

2019-06-07 Thread Anton Kedin
The topic of schema registries probably does not block the design and implementation of logical types and portable schemas by themselves, however I think we should spend some time discussing it (probably in a separate thread) so that all SDKs have similar mechanisms for schema registration and

Re: [DISCUSS] Portability representation of schemas

2019-06-07 Thread Robert Burke
Wouldn't SDK specific types always be under the "coders" component instead of the logical type listing? Offhand, having a separate normalized listing of logical schema types in the pipeline components message of the types seems about right. Then they're unambiguous, but can also either refer to

Re: [DISCUSS] Portability representation of schemas

2019-06-05 Thread Brian Hulette
If we want to have a Pipeline level registry, we could add it to Components [1]. message Components { ... map logical_types; } And in FieldType reference the logical types by id: oneof field_type { AtomicType atomic_type; ArrayType array_type; ... string logical_type_id;// was

Re: [DISCUSS] Portability representation of schemas

2019-06-04 Thread Brian Hulette
Yeah that's what I meant. It does seem logical reasonable to scope any registry by pipeline and not by PCollection. Then it seems we would want the entire LogicalType (including the `FieldType representation` field) as the value type, and not just LogicalTypeConversion. Otherwise we're separating

Re: [DISCUSS] Portability representation of schemas

2019-06-04 Thread Reuven Lax
On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette wrote: > > > On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax wrote: > >> >> >> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette >> wrote: >> >>> > It has to go into the proto somewhere (since that's the only way the >>> SDK can get it), but I'm not sure

Re: [DISCUSS] Portability representation of schemas

2019-06-04 Thread Brian Hulette
On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax wrote: > > > On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette wrote: > >> > It has to go into the proto somewhere (since that's the only way the >> SDK can get it), but I'm not sure they should be considered integral parts >> of the type. >> Are you just

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Reuven Lax
On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette wrote: > > It has to go into the proto somewhere (since that's the only way the > SDK can get it), but I'm not sure they should be considered integral parts > of the type. > Are you just advocating for an approach where any SDK-specific information >

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Brian Hulette
> It has to go into the proto somewhere (since that's the only way the SDK can get it), but I'm not sure they should be considered integral parts of the type. Are you just advocating for an approach where any SDK-specific information is stored outside of the Schema message itself so that Schema

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Kenneth Knowles
On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax wrote: > So I feel a bit leery about making the to/from functions a fundamental > part of the portability representation. In my mind, that is very tied to a > specific SDK/language. A SDK (say the Java SDK) wants to allow users to use > a wide variety

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Reuven Lax
So I feel a bit leery about making the to/from functions a fundamental part of the portability representation. In my mind, that is very tied to a specific SDK/language. A SDK (say the Java SDK) wants to allow users to use a wide variety of native types with schemas, and under the covers uses the

Re: [DISCUSS] Portability representation of schemas

2019-06-03 Thread Brian Hulette
Ah I see, I didn't realize that. Then I suppose we'll need to/from functions somewhere in the logical type conversion to preserve the current behavior. I'm still a little hesitant to make these functions an explicit part of LogicalTypeConversion for another reason. Down the road, schemas could

Re: [DISCUSS] Portability representation of schemas

2019-06-01 Thread Reuven Lax
Keep in mind that right now the SchemaRegistry is only assumed to exist at graph-construction time, not at execution time; all information in the schema registry is embedded in the SchemaCoder, which is the only thing we keep around when the pipeline is actually running. We could look into

Re: [DISCUSS] Portability representation of schemas

2019-05-31 Thread Brian Hulette
> Can you propose what the protos would look like in this case? Right now LogicalType does not contain the to/from conversion functions in the proto. Do you think we'll need to add these in? Maybe. Right now the proposed LogicalType message is pretty simple/generic: message LogicalType {

Re: [DISCUSS] Portability representation of schemas

2019-05-28 Thread Lukasz Cwik
I like the concept of expressing type coercion as a wrapper coder which says that this language treats this type as Foo. This seems to be useful in general for cross language pipelines since it is much more likely that two languages will understand an encoding but may want to express the type

Re: [DISCUSS] Portability representation of schemas

2019-05-28 Thread Brian Hulette
On Sun, May 26, 2019 at 1:25 PM Reuven Lax wrote: > > > On Fri, May 24, 2019 at 11:42 AM Brian Hulette > wrote: > >> *tl;dr:* SchemaCoder represents a logical type with a base type of Row >> and we should think about that. >> >> I'm a little concerned that the current proposals for a portable

Re: [DISCUSS] Portability representation of schemas

2019-05-26 Thread Reuven Lax
On Fri, May 24, 2019 at 11:42 AM Brian Hulette wrote: > *tl;dr:* SchemaCoder represents a logical type with a base type of Row > and we should think about that. > > I'm a little concerned that the current proposals for a portable > representation don't actually fully represent Schemas. It seems

Re: [DISCUSS] Portability representation of schemas

2019-05-24 Thread Lukasz Cwik
Your reasoning about SchemaCoder really being a type coercion coder makes a lot of sense to me. On Fri, May 24, 2019 at 11:42 AM Brian Hulette wrote: > *tl;dr:* SchemaCoder represents a logical type with a base type of Row > and we should think about that. > > I'm a little concerned that the

Re: [DISCUSS] Portability representation of schemas

2019-05-24 Thread Brian Hulette
*tl;dr:* SchemaCoder represents a logical type with a base type of Row and we should think about that. I'm a little concerned that the current proposals for a portable representation don't actually fully represent Schemas. It seems to me that the current java-only Schemas are made up three

Re: [DISCUSS] Portability representation of schemas

2019-05-10 Thread Brian Hulette
Ah thanks! I added some language there. *From: *Kenneth Knowles *Date: *Thu, May 9, 2019 at 5:31 PM *To: *dev > *From: *Brian Hulette > *Date: *Thu, May 9, 2019 at 2:02 PM > *To: * > > We briefly discussed using arrow schemas in place of beam schemas entirely >> in an arrow thread [1]. The

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Kenneth Knowles
*From: *Brian Hulette *Date: *Thu, May 9, 2019 at 2:02 PM *To: * We briefly discussed using arrow schemas in place of beam schemas entirely > in an arrow thread [1]. The biggest reason not to this was that we wanted > to have a type for large iterables in beam schemas. But given that large >

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Brian Hulette
We briefly discussed using arrow schemas in place of beam schemas entirely in an arrow thread [1]. The biggest reason not to this was that we wanted to have a type for large iterables in beam schemas. But given that large iterables aren't currently implemented, beam schemas look very similar to

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw
From: Reuven Lax Date: Thu, May 9, 2019 at 7:29 PM To: dev > Also in the future we might be able to do optimizations at the runner level > if at the portability layer we understood schemes instead of just raw coders. > This could be things like only parsing a subset of a row (if we know only a

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Reuven Lax
Also in the future we might be able to do optimizations at the runner level if at the portability layer we understood schemes instead of just raw coders. This could be things like only parsing a subset of a row (if we know only a few fields are accessed) or using a columnar data structure like

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw
From: Kenneth Knowles Date: Thu, May 9, 2019 at 5:44 PM To: dev >> > *Why multiple int types?* The domain of values for these types are >> > different. For a language with one "int" or "number" type, that's another >> > domain of values. >> >> What is the value in having different domains? If

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw
On the flip side, Schemas are equivalent to the space of Coders with the addition of a RowCoder and the ability to materialize to something other than bytes, right? (Perhaps I'm missing something big here...) This may make a backwards-compatible transition easier. (SDK-side, the ability to reason

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Kenneth Knowles
*From: *Robert Bradshaw *Date: *Thu, May 9, 2019 at 7:48 AM *To: *dev From: Kenneth Knowles > Date: Thu, May 9, 2019 at 10:05 AM > To: dev > > > This is a huge development. Top posting because I can be more compact. > > > > I really think after the initial idea converges this needs a design doc

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Alex Van Boxel
OK, fair. This is parallel how timestamp are implemented in protobuf. Then it's important (and I'll join the design doc) that we have a list of standard logical types. _/ _/ Alex Van Boxel On Thu, May 9, 2019 at 4:11 PM Reuven Lax wrote: > > > On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Reuven Lax
FYI I can imagine a world in which we have no coders. We could define the entire model on top of schemas. Today's "Coder" is completely equivalent to a single-field schema with a logical-type field (actually the latter is slightly more expressive as you aren't forced to serialize into bytes). Due

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Robert Bradshaw
From: Kenneth Knowles Date: Thu, May 9, 2019 at 10:05 AM To: dev > This is a huge development. Top posting because I can be more compact. > > I really think after the initial idea converges this needs a design doc with > goals and alternatives. It is an extraordinarily consequential model

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Reuven Lax
On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel wrote: > My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a > better name for DATETIME) a first class citizen that we get > *inconsistencies* between the difference portability implementations. The > same holds true for

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Alex Van Boxel
My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a better name for DATETIME) a first class citizen that we get *inconsistencies* between the difference portability implementations. The same holds true for DECIMAL and DURATION. If we aren't given pipeline developers a

Re: [DISCUSS] Portability representation of schemas

2019-05-09 Thread Kenneth Knowles
This is a huge development. Top posting because I can be more compact. I really think after the initial idea converges this needs a design doc with goals and alternatives. It is an extraordinarily consequential model change. So in the spirit of doing the work / bias towards action, I created a

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Udi Meiri
>From a Python type hints perspective, how do schemas fit? Type hints are currently used to determine which coder to use. It seems that given a schema field, it would be useful to be able to convert it to a coder (using URNs?), and to convert the coder into a typing type. This would allow for

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Lukasz Cwik
Are you suggesting that schemas become an explicit field on PCollection or that the coder on PCollections has a well known schema coder type that has a payload that has field names, ids, type, ...? I'm much more for the latter since it allows for versioning schema representations over time without

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Reuven Lax
On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw wrote: > Very excited to see this. In particular, I think this will be very > useful for cross-language pipelines (not just SQL, but also for > describing non-trivial data (e.g. for source and sink reuse). > > The proto specification makes sense to

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Robert Bradshaw
Very excited to see this. In particular, I think this will be very useful for cross-language pipelines (not just SQL, but also for describing non-trivial data (e.g. for source and sink reuse). The proto specification makes sense to me. The only thing that looks like it's missing (other than

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Reuven Lax
On Wed, May 8, 2019 at 10:57 AM Rui Wang wrote: > Regarding to DATETIME, I totally agree it should be removed as > primitive type to avoid that each language has to find their time libraries > (and if they could not find any, they will likely go to logical type and > use int64 from Schema). > >

Re: [DISCUSS] Portability representation of schemas

2019-05-08 Thread Rui Wang
Regarding to DATETIME, I totally agree it should be removed as primitive type to avoid that each language has to find their time libraries (and if they could not find any, they will likely go to logical type and use int64 from Schema). I have two questions regarding to the representation: 1.

[DISCUSS] Portability representation of schemas

2019-05-08 Thread Reuven Lax
Beam Java's support for schemas is just about done: we infer schemas from a variety of types, we have a variety of utility transforms (join, aggregate, etc.) for schemas, and schemas are integrated with the ParDo machinery. The big remaining task I'm working on is writing documentation and