Hi Reuven,
Agree to be able to describe the schema with different format. The good
point about json schemas is that they are described by a spec. My point
is also to avoid the reinvent the wheel. Just an abstract to be able to
use Avro, Json, Calcite, custom schema descriptors would be great.
Using coder to describe a schema sounds like a smart move to implement
quickly. However, it has to be clear in term of documentation to avoid
"side effect". I still think PCollection.setSchema() is better: it
should be metadata (or hint ;))) on the PCollection.
Regards
JB
On 31/01/2018 20:16, Reuven Lax wrote:
As to the question of how a schema should be specified, I want to
support several common schema formats. So if a user has a Json schema,
or an Avro schema, or a Calcite schema, etc. there should be adapters
that allow setting a schema from any of them. I don't think we should
prefer one over the other. While Romain is right that many people know
Json, I think far fewer people know Json schemas.
Agree, schemas should not be enforced (for one thing, that wouldn't be
backwards compatible!). I think for the initial prototype I will
probably use a special coder to represent the schema (with setSchema an
option on the coder), largely because it doesn't require modifying
PCollection. However I think longer term a schema should be an optional
piece of metadata on the PCollection object. Similar to the previous
discussion about "hints," I think this can be set on the producing
PTransform, and a SetSchema PTransform will allow attaching a schema to
any PCollection (i.e. pc.apply(SetSchema.of(schema))). This part isn't
designed yet, but I think schema should be similar to hints, it's just
another piece of metadata on the PCollection (though something
interpreted by the model, where hints are interpreted by the runner)
Reuven
On Tue, Jan 30, 2018 at 1:37 AM, Jean-Baptiste Onofré <j...@nanthrax.net
<mailto:j...@nanthrax.net>> wrote:
Hi,
I think we should avoid to mix two things in the discussion (and so
the document):
1. The element of the collection and the schema itself are two
different things.
By essence, Beam should not enforce any schema. That's why I think
it's a good
idea to set the schema optionally on the PCollection
(pcollection.setSchema()).
2. From point 1 comes two questions: how do we represent a schema ?
How can we
leverage the schema to simplify the serialization of the element in the
PCollection and query ? These two questions are not directly related.
2.1 How do we represent the schema
Json Schema is a very interesting idea. It could be an abstract and
other
providers, like Avro, can be bind on it. It's part of the json
processing spec
(javax).
2.2. How do we leverage the schema for query and serialization
Also in the spec, json pointer is interesting for the querying.
Regarding the
serialization, jackson or other data binder can be used.
It's still rough ideas in my mind, but I like Romain's idea about
json-p usage.
Once 2.3.0 release is out, I will start to update the document with
those ideas,
and PoC.
Thanks !
Regards
JB
On 01/30/2018 08:42 AM, Romain Manni-Bucau wrote:
>
>
> Le 30 janv. 2018 01:09, "Reuven Lax" <re...@google.com
<mailto:re...@google.com>
> <mailto:re...@google.com <mailto:re...@google.com>>> a écrit :
>
>
>
> On Mon, Jan 29, 2018 at 12:17 PM, Romain Manni-Bucau <rmannibu...@gmail.com
<mailto:rmannibu...@gmail.com>
> <mailto:rmannibu...@gmail.com
<mailto:rmannibu...@gmail.com>>> wrote:
>
> Hi
>
> I have some questions on this: how hierarchic schemas
would work? Seems
> it is not really supported by the ecosystem (out of
custom stuff) :(.
> How would it integrate smoothly with other generic record
types - N bridges?
>
>
> Do you mean nested schemas? What do you mean here?
>
>
> Yes, sorry - wrote the mail too late ;). Was hierarchic data and
nested schemas.
>
>
> Concretely I wonder if using json API couldnt be
beneficial: json-p is a
> nice generic abstraction with a built in querying
mecanism (jsonpointer)
> but no actual serialization (even if json and binary json
are very
> natural). The big advantage is to have a well known
ecosystem - who
> doesnt know json today? - that beam can reuse for free:
JsonObject
> (guess we dont want JsonValue abstraction) for the record
type,
> jsonschema standard for the schema, jsonpointer for the
> delection/projection etc... It doesnt enforce the actual
serialization
> (json, smile, avro, ...) but provide an expressive and
alread known API
> so i see it as a big win-win for users (no need to learn
a new API and
> use N bridges in all ways) and beam (impls are here and
API design
> already thought).
>
>
> I assume you're talking about the API for setting schemas,
not using them.
> Json has many downsides and I'm not sure it's true that
everyone knows it;
> there are also competing schema APIs, such as Avro etc..
However I think we
> should give Json a fair evaluation before dismissing it.
>
>
> It is a wider topic than schema. Actually schema are not the
first citizen but a
> generic data representation is. That is where json hits almost
any other API.
> Then, when it comes to schema, json has a standard for that so we
are all good.
>
> Also json has a good indexing API compared to alternatives which
are sometimes a
> bit faster - for noop transforms - but are hardly usable or make
the code not
> that readable.
>
> Avro is a nice competitor but it is compatible - actually avro is
json driven by
> design - but its API is far to be that easy due to its schema
enforcement which
> is heavvvyyy and worse is you cant work with avro without a
schema. Json would
> allow to reconciliate the dynamic and static cases since the job
wouldnt change
> except the setschema.
>
> That is why I think json is a good compromise and having a
standard API for it
> allow to fully customize the imol as will if needed - even using
avro or protobuf.
>
> Side note on beam api: i dont think it is good to use a main API
for runner
> optimization. It enforces something to be shared on all runners
but not widely
> usable. It is also misleading for users. Would you set a flink
pipeline option
> with dataflow? My proposal here is to use hints - properties -
instead of
> something hardly defined in the API then standardize it if all
runners support it.
>
>
>
> Wdyt?
>
> Le 29 janv. 2018 06:24, "Jean-Baptiste Onofré"
<j...@nanthrax.net <mailto:j...@nanthrax.net>
> <mailto:j...@nanthrax.net <mailto:j...@nanthrax.net>>> a écrit :
>
> Hi Reuven,
>
> Thanks for the update ! As I'm working with you on
this, I fully
> agree and great
> doc gathering the ideas.
>
> It's clearly something we have to add asap in Beam,
because it would
> allow new
> use cases for our users (in a simple way) and open
new areas for the
> runners
> (for instance dataframe support in the Spark runner).
>
> By the way, while ago, I created BEAM-3437 to track
the PoC/PR
> around this.
>
> Thanks !
>
> Regards
> JB
>
> On 01/29/2018 02:08 AM, Reuven Lax wrote:
> > Previously I submitted a proposal for adding
schemas as a
> first-class concept on
> > Beam PCollections. The proposal engendered quite a
bit of
> discussion from the
> > community - more discussion than I've seen from
almost any of our
> proposals to
> > date!
> >
> > Based on the feedback and comments, I reworked the
proposal
> document quite a
> > bit. It now talks more explicitly about the
different between
> dynamic schemas
> > (where the schema is not fully not know at
graph-creation time),
> and static
> > schemas (which are fully know at graph-creation
time). Proposed
> APIs are more
> > fleshed out now (again thanks to feedback from
community members),
> and the
> > document talks in more detail about evolving schemas in
> long-running streaming
> > pipelines.
> >
> > Please take a look. I think this will be very
valuable to Beam,
> and welcome any
> > feedback.
> >
> >
>
https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#
<https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>
>
<https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit# <https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#>>
> >
> > Reuven
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org <mailto:jbono...@apache.org>
<mailto:jbono...@apache.org <mailto:jbono...@apache.org>>
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
>
>
--
Jean-Baptiste Onofré
jbono...@apache.org <mailto:jbono...@apache.org>
http://blog.nanthrax.net
Talend - http://www.talend.com