Thanks, Brian. It makes sense, it wasn't entirely clear from commit
messages, that's why I wanted to double check.
On Tue, Sep 3, 2019 at 5:43 PM Brian Hulette wrote:
> Hey Gleb, thanks for bringing this up. The PR that was reverted (8853) is
> the same one that I referred to earlier in this
Hey Gleb, thanks for bringing this up. The PR that was reverted (8853) is
the same one that I referred to earlier in this thread. It modified the
existing portable schema representation to match what we settled on here -
and yes it removed support for logical types like fixed bytes. I
(foolishly)
Recently there was a pull request (that was reverted) for adding portable
representation of schemas. It's great to see things moving forward, I'm
worried that it doesn't support any logical types, especially fixed bytes.
That makes runners using portable schemas unusable, for instance, when
Seems like a practical approach to get moving on things. Just to restate my
understanding:
- in Java it is PCollection but with the row coder holding
to/from/clazz (I'm calling it row coder because row is the binary format,
while schemas may have multiple possible formats)
- in portability, the
Robert, you are correct that in principle the to/from functions are needed
on the operation, as that's where automatic conversion happens (in Java it
happens in DoFnRunner). However there are two blockers there:
1. As Brian mentioned, the issue in Java is that we never have
PCollection in this
Thanks for updating that alternative.
As for the to/from functions, it does seem pragmatic to dangle them
off the purely portable representation (either as a field there, or as
an opaque logical type whose payload contains the to/from functions,
or a separate coder that wraps the schema coder
Realized I completely ignored one of your points, added another response
inline.
On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote:
> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote:
> >
> > Spoke to Brian about his proposal. It is essentially this:
> >
> > We create PortableSchemaCoder,
On Fri, Jun 14, 2019 at 2:20 AM Robert Bradshaw wrote:
> On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote:
> >
> > Spoke to Brian about his proposal. It is essentially this:
> >
> > We create PortableSchemaCoder, with a well-known URN. This coder is
> parameterized by the schema (i.e. list of
On Thu, Jun 13, 2019 at 8:42 PM Reuven Lax wrote:
>
> Spoke to Brian about his proposal. It is essentially this:
>
> We create PortableSchemaCoder, with a well-known URN. This coder is
> parameterized by the schema (i.e. list of field name -> field type pairs).
Given that we have a field type
Yes that's pretty much what I had in mind. The one point I'm unsure about
is that I was thinking the *calling* SDK would need to insert the transform
to convert to/from Rows (unless it's an SDK that uses the portable
SchemaCoder everywhere and doesn't need a conversion). For example, python
might
Spoke to Brian about his proposal. It is essentially this:
We create PortableSchemaCoder, with a well-known URN. This coder is
parameterized by the schema (i.e. list of field name -> field type pairs).
Java also continues to have its own CustomSchemaCoder. This is
parameterized by the schema as
On Thu, Jun 13, 2019 at 5:47 AM Reuven Lax wrote:
>
> On Wed, Jun 12, 2019 at 8:29 PM Kenneth Knowles wrote:
>
>> Can we choose a first step? I feel there's consensus around:
>>
>> - the basic idea of what a schema looks like, ignoring logical types or
>> SDK-specific bits
>> - the version of
Can we choose a first step? I feel there's consensus around:
- the basic idea of what a schema looks like, ignoring logical types or
SDK-specific bits
- the version of logical type which is a standardized URN+payload plus a
representation
Perhaps we could commit this and see what it looks like
On Wed, Jun 12, 2019 at 3:46 PM Robert Bradshaw wrote:
> On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote:
>
>>
>> I believe the schema registry is a transient construction-time concept. I
>> don't think there's any need for a concept of a registry in the portable
>> representation.
>>
>>
If we go with Reuven's (2) then a logical type
like urn:beam:logical:javasdk is not constraining at all--any SDK/runner
that does not understand this can simply act on its representation (and if
it does not understand that, look at it's representation, all the way back
to primitives). However, I
On Tue, Jun 11, 2019 at 8:04 PM Kenneth Knowles wrote:
>
> I believe the schema registry is a transient construction-time concept. I
> don't think there's any need for a concept of a registry in the portable
> representation.
>
> I'd rather urn:beam:schema:logicaltype:javasdk not be used
On Wed, Jun 12, 2019 at 2:01 PM Reuven Lax wrote:
> Two thoughts here:
>
> 1. I don't think we should worry about the to/from functions much here.
> From the "portable" perspective, I think the schema should be all that's
> necessary. A given SDK - say the Java SDK - might want to present a
Two thoughts here:
1. I don't think we should worry about the to/from functions much here.
>From the "portable" perspective, I think the schema should be all that's
necessary. A given SDK - say the Java SDK - might want to present a nicer
programming interface by allowing users to use the types
Snipping because the context is getting out of hand.
On Mon, Jun 10, 2019 at 3:42 PM Robert Bradshaw wrote:
> On Mon, Jun 10, 2019 at 11:53 PM Kenneth Knowles wrote:
>
>> Most things you would do directly to a representation without knowing
>> what it represents are going to be nonsense. But
On Mon, Jun 10, 2019 at 11:53 PM Kenneth Knowles wrote:
> Good points. At a high level it doesn't sound like anything is blocking,
> right?
>
It doesn't sound like we've settled on an actual proto definition yet.
which may be influenced by the questions below.
> On Mon, Jun 10, 2019 at 2:14
Good points. At a high level it doesn't sound like anything is blocking,
right?
On Mon, Jun 10, 2019 at 2:14 AM Robert Bradshaw wrote:
> On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote:
>
>> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote:
>>
>>> Wouldn't SDK specific types always be
On Sat, Jun 8, 2019 at 9:25 PM Kenneth Knowles wrote:
> On Fri, Jun 7, 2019 at 4:35 AM Robert Burke wrote:
>
>> Wouldn't SDK specific types always be under the "coders" component
>> instead of the logical type listing?
>>
>> Offhand, having a separate normalized listing of logical schema types
The topic of schema registries probably does not block the design and
implementation of logical types and portable schemas by themselves, however
I think we should spend some time discussing it (probably in a separate
thread) so that all SDKs have similar mechanisms for schema registration
and
Wouldn't SDK specific types always be under the "coders" component instead
of the logical type listing?
Offhand, having a separate normalized listing of logical schema types in
the pipeline components message of the types seems about right. Then
they're unambiguous, but can also either refer to
If we want to have a Pipeline level registry, we could add it to Components
[1].
message Components {
...
map logical_types;
}
And in FieldType reference the logical types by id:
oneof field_type {
AtomicType atomic_type;
ArrayType array_type;
...
string logical_type_id;// was
Yeah that's what I meant. It does seem logical reasonable to scope any
registry by pipeline and not by PCollection. Then it seems we would want
the entire LogicalType (including the `FieldType representation` field) as
the value type, and not just LogicalTypeConversion. Otherwise we're
separating
On Tue, Jun 4, 2019 at 9:20 AM Brian Hulette wrote:
>
>
> On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax wrote:
>
>>
>>
>> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette
>> wrote:
>>
>>> > It has to go into the proto somewhere (since that's the only way the
>>> SDK can get it), but I'm not sure
On Mon, Jun 3, 2019 at 10:04 PM Reuven Lax wrote:
>
>
> On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette wrote:
>
>> > It has to go into the proto somewhere (since that's the only way the
>> SDK can get it), but I'm not sure they should be considered integral parts
>> of the type.
>> Are you just
On Mon, Jun 3, 2019 at 12:27 PM Brian Hulette wrote:
> > It has to go into the proto somewhere (since that's the only way the
> SDK can get it), but I'm not sure they should be considered integral parts
> of the type.
> Are you just advocating for an approach where any SDK-specific information
>
> It has to go into the proto somewhere (since that's the only way the SDK
can get it), but I'm not sure they should be considered integral parts of
the type.
Are you just advocating for an approach where any SDK-specific information
is stored outside of the Schema message itself so that Schema
On Mon, Jun 3, 2019 at 10:53 AM Reuven Lax wrote:
> So I feel a bit leery about making the to/from functions a fundamental
> part of the portability representation. In my mind, that is very tied to a
> specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
> a wide variety
So I feel a bit leery about making the to/from functions a fundamental part
of the portability representation. In my mind, that is very tied to a
specific SDK/language. A SDK (say the Java SDK) wants to allow users to use
a wide variety of native types with schemas, and under the covers uses the
Ah I see, I didn't realize that. Then I suppose we'll need to/from
functions somewhere in the logical type conversion to preserve the current
behavior.
I'm still a little hesitant to make these functions an explicit part of
LogicalTypeConversion for another reason. Down the road, schemas could
Keep in mind that right now the SchemaRegistry is only assumed to exist at
graph-construction time, not at execution time; all information in the
schema registry is embedded in the SchemaCoder, which is the only thing we
keep around when the pipeline is actually running. We could look into
> Can you propose what the protos would look like in this case? Right now
LogicalType does not contain the to/from conversion functions in the proto.
Do you think we'll need to add these in?
Maybe. Right now the proposed LogicalType message is pretty simple/generic:
message LogicalType {
I like the concept of expressing type coercion as a wrapper coder which
says that this language treats this type as Foo. This seems to be useful in
general for cross language pipelines since it is much more likely that two
languages will understand an encoding but may want to express the type
On Sun, May 26, 2019 at 1:25 PM Reuven Lax wrote:
>
>
> On Fri, May 24, 2019 at 11:42 AM Brian Hulette
> wrote:
>
>> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
>> and we should think about that.
>>
>> I'm a little concerned that the current proposals for a portable
On Fri, May 24, 2019 at 11:42 AM Brian Hulette wrote:
> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
> and we should think about that.
>
> I'm a little concerned that the current proposals for a portable
> representation don't actually fully represent Schemas. It seems
Your reasoning about SchemaCoder really being a type coercion coder makes a
lot of sense to me.
On Fri, May 24, 2019 at 11:42 AM Brian Hulette wrote:
> *tl;dr:* SchemaCoder represents a logical type with a base type of Row
> and we should think about that.
>
> I'm a little concerned that the
*tl;dr:* SchemaCoder represents a logical type with a base type of Row and
we should think about that.
I'm a little concerned that the current proposals for a portable
representation don't actually fully represent Schemas. It seems to me that
the current java-only Schemas are made up three
Ah thanks! I added some language there.
*From: *Kenneth Knowles
*Date: *Thu, May 9, 2019 at 5:31 PM
*To: *dev
> *From: *Brian Hulette
> *Date: *Thu, May 9, 2019 at 2:02 PM
> *To: *
>
> We briefly discussed using arrow schemas in place of beam schemas entirely
>> in an arrow thread [1]. The
*From: *Brian Hulette
*Date: *Thu, May 9, 2019 at 2:02 PM
*To: *
We briefly discussed using arrow schemas in place of beam schemas entirely
> in an arrow thread [1]. The biggest reason not to this was that we wanted
> to have a type for large iterables in beam schemas. But given that large
>
We briefly discussed using arrow schemas in place of beam schemas entirely
in an arrow thread [1]. The biggest reason not to this was that we wanted
to have a type for large iterables in beam schemas. But given that large
iterables aren't currently implemented, beam schemas look very similar to
From: Reuven Lax
Date: Thu, May 9, 2019 at 7:29 PM
To: dev
> Also in the future we might be able to do optimizations at the runner level
> if at the portability layer we understood schemes instead of just raw coders.
> This could be things like only parsing a subset of a row (if we know only a
Also in the future we might be able to do optimizations at the runner level
if at the portability layer we understood schemes instead of just raw
coders. This could be things like only parsing a subset of a row (if we
know only a few fields are accessed) or using a columnar data structure
like
From: Kenneth Knowles
Date: Thu, May 9, 2019 at 5:44 PM
To: dev
>> > *Why multiple int types?* The domain of values for these types are
>> > different. For a language with one "int" or "number" type, that's another
>> > domain of values.
>>
>> What is the value in having different domains? If
On the flip side, Schemas are equivalent to the space of Coders with
the addition of a RowCoder and the ability to materialize to something
other than bytes, right? (Perhaps I'm missing something big here...)
This may make a backwards-compatible transition easier. (SDK-side, the
ability to reason
*From: *Robert Bradshaw
*Date: *Thu, May 9, 2019 at 7:48 AM
*To: *dev
From: Kenneth Knowles
> Date: Thu, May 9, 2019 at 10:05 AM
> To: dev
>
> > This is a huge development. Top posting because I can be more compact.
> >
> > I really think after the initial idea converges this needs a design doc
OK, fair. This is parallel how timestamp are implemented in protobuf. Then
it's important (and I'll join the design doc) that we have a list of
standard logical types.
_/
_/ Alex Van Boxel
On Thu, May 9, 2019 at 4:11 PM Reuven Lax wrote:
>
>
> On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel
FYI I can imagine a world in which we have no coders. We could define the
entire model on top of schemas. Today's "Coder" is completely equivalent to
a single-field schema with a logical-type field (actually the latter is
slightly more expressive as you aren't forced to serialize into bytes).
Due
From: Kenneth Knowles
Date: Thu, May 9, 2019 at 10:05 AM
To: dev
> This is a huge development. Top posting because I can be more compact.
>
> I really think after the initial idea converges this needs a design doc with
> goals and alternatives. It is an extraordinarily consequential model
On Thu, May 9, 2019 at 6:34 AM Alex Van Boxel wrote:
> My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a
> better name for DATETIME) a first class citizen that we get
> *inconsistencies* between the difference portability implementations. The
> same holds true for
My biggest concern is that if we don't make TIMESTAMP (yes, TIMESTAMP is a
better name for DATETIME) a first class citizen that we get
*inconsistencies* between the difference portability implementations. The
same holds true for DECIMAL and DURATION. If we aren't given pipeline
developers a
This is a huge development. Top posting because I can be more compact.
I really think after the initial idea converges this needs a design doc
with goals and alternatives. It is an extraordinarily consequential model
change. So in the spirit of doing the work / bias towards action, I created
a
>From a Python type hints perspective, how do schemas fit? Type hints are
currently used to determine which coder to use.
It seems that given a schema field, it would be useful to be able to
convert it to a coder (using URNs?), and to convert the coder into a typing
type.
This would allow for
Are you suggesting that schemas become an explicit field on PCollection or
that the coder on PCollections has a well known schema coder type that has
a payload that has field names, ids, type, ...?
I'm much more for the latter since it allows for versioning schema
representations over time without
On Wed, May 8, 2019 at 1:23 PM Robert Bradshaw wrote:
> Very excited to see this. In particular, I think this will be very
> useful for cross-language pipelines (not just SQL, but also for
> describing non-trivial data (e.g. for source and sink reuse).
>
> The proto specification makes sense to
Very excited to see this. In particular, I think this will be very
useful for cross-language pipelines (not just SQL, but also for
describing non-trivial data (e.g. for source and sink reuse).
The proto specification makes sense to me. The only thing that looks
like it's missing (other than
On Wed, May 8, 2019 at 10:57 AM Rui Wang wrote:
> Regarding to DATETIME, I totally agree it should be removed as
> primitive type to avoid that each language has to find their time libraries
> (and if they could not find any, they will likely go to logical type and
> use int64 from Schema).
>
>
Regarding to DATETIME, I totally agree it should be removed as
primitive type to avoid that each language has to find their time libraries
(and if they could not find any, they will likely go to logical type and
use int64 from Schema).
I have two questions regarding to the representation:
1.
Beam Java's support for schemas is just about done: we infer schemas from a
variety of types, we have a variety of utility transforms (join, aggregate,
etc.) for schemas, and schemas are integrated with the ParDo machinery. The
big remaining task I'm working on is writing documentation and
61 matches
Mail list logo