Re: [DISCUSS] Portability representation of schemas

Robert Bradshaw Thu, 09 May 2019 10:23:02 -0700

From: Kenneth Knowles <[email protected]>
Date: Thu, May 9, 2019 at 5:44 PM
To: dev


>> > *Why multiple int types?* The domain of values for these types are 
>> > different. For a language with one "int" or "number" type, that's another 
>> > domain of values.
>>
>> What is the value in having different domains? If your data has a
>> natural domain, chances are it doesn't line up exactly with one of
>> these. I guess it's for languages whose types have specific domains?
>> (There's also compactness in representation, encoded and in-memory,
>> though I'm not sure that's high.)
>
> Are you asking why have int16, int32, in64 as opposed to a single domain of 
> "integers"? Most languages have some of these types so it is a pretty natural 
> fit. They also can have a fixed width encoding; I'm not expert in whether 
> that becomes important for columnar batches.

Languages having these types is a good argument. (As for importance
for columnar operations, just the memory size advantages (e.g. getting
them into and storing more of them in a CPU cache).

>> > *Columnar/Arrow*: making sure we unlock the ability to take this path is 
>> > Paramount. So tying it directly to a row-oriented coder seems 
>> > counterproductive.
>>
>> I don't think Coders are necessarily row-oriented. They are, however,
>> bytes-oriented. (Perhaps they need not be.) There seems to be a lot of
>> overlap between what Coders express in terms of element typing
>> information and what Schemas express, and I'd rather have one concept
>> if possible. Or have a clear division of responsibilities.
>
> A coder is more-or-less a function from element -> bytes. Do you have a 
> different idea? Like using coders just as a type declaration and having the 
> SDK/runner have a second interface that it interacts with?

Coders are (currently) the objects we use to represent and reason
about types, as well as to serialize elements. Schemas are moving into
this space as well.

>> > *Multimap*: what does it add over an array-valued map or 
>> > large-iterable-valued map? (honest question, not rhetorical)
>>
>> Multimap has a different notion of what it means to contain a value,
>> can handle (unordered) unions of non-disjoint keys, etc. Maybe this
>> isn't worth a new primitive type.
>
> I guess it might come down to whether MultiMap<k, v> ::= Map<k, Iterable<v>> 
> as a logical type is efficient or merits a different encoding. No strong 
> opinion.

Yeah, ties into the meaning of logical types. Using the same encoding
is probably just fine.

>> > *URN/enum for type names*: I see the case for both. The core types are 
>> > fundamental enough they should never really change - after all, proto, 
>> > thrift, avro, arrow, have addressed this (not to mention most programming 
>> > languages). Maybe additions once every few years. I prefer the smallest 
>> > intersection of these schema languages. A oneof is more clear, while URN 
>> > emphasizes the similarity of built-in and logical types.
>>
>> Hmm... Do we have any examples of the multi-level primitive/logical
>> type in any of these other systems?
>
> Yes, I'd say it is the rule not the exception: 
> https://github.com/protocolbuffers/protobuf/blob/d9ccd0c0e6bbda9bf4476088eeb46b02d7dcd327/java/compatibility_tests/v2.5.0/more_protos/src/proto/google/protobuf/descriptor.proto#L104

This doesn't have an open-ended type system (or the notion of logical types).

>> I have a bias towards all types
>> being on the same footing unless there is compelling reason to divide
>> things into primitive/use-defined ones.
>
> To be clear, my understanding here is that this an AST representation 
> question, not an expressivity or user-facing API question. I don't think URNs 
> vs oneof affects the universe of schemas, how their values are embedded in 
> specific languages, and how they are encoded. Today the difference is 
> front-and-center in Java but that is not fundamental and we could come up 
> with an in-Java representation that made all types look equivalent to users. 
> Now, the choice of what goes in the oneof and which URNs to standardize is a 
> different and one of the biggest decisions. I just meant to comment on the 
> minor issue.

Agreed. I think the AST should define how we think about the model,
which does influence into the API (and consistency across languages,
insofar as it makes sense). Exactly where logical types fit in seems
like the biggest open question here. (I'm curious about the history;
did schemas originally start with an enumeration of allowed types, and
then logical types were added on when this was discovered to be not
enough, and would we have come up with this structure had we wanted to
make it open-ended at the start?)

Re: [DISCUSS] Portability representation of schemas

Reply via email to