Re: [DISCUSS] Structuring Java based DSLs

Kenneth Knowles Mon, 03 Dec 2018 10:28:24 -0800

To be honest, I don't think there's much worth doing right now. I think
more self-contained is better for Beam SQL, generally. Two things I have on
my mind are (1) SQL as an inline transform in every SDK and (2) supporting
pure SQL like the CLI and JDBC driver, where the underlying language is an
implementation detail.


Big picture / long term, I would envision pure SQL, embedded SQL transform,
and a DataFrame-like API in ~each SDK all desugaring to relational algebra
nodes, sharing an optimizer, sharing some amount of mapping the physical
plan to Beam transforms. The necessarily SDK-specific parts are the
embedded transform API and UDFs in the host language. The rest should
remain an implementation detail that we can change.

 - For example, it is easy to imagine a customized columnar element/bundle
encoding and SDK harness that only works for SQL to remove overhead of
being general purpose. It could be written in C/C++/Go if we wanted to
squeeze it for perf. Such things are made harder by having an elaborate
end-user API between SQL and the core Beam model.
 - Conversely, for whatever is chosen to underlie SQL's execution,
stability is paramount. Ideally the simplest and least likely to change
transforms would be the foundation. And I wouldn't want to have to design a
user-friendly API for Euphoria or the join library just to enable a
different join algorithm in SQL.

So my take is keep SQL flexible, implement SQL on low-level and stable
APIs, use join library, Euphoria, etc, if it looks like a big win, but
don't build any policy here or do big refactors right now.

Kenn

On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <[email protected]> wrote:

> Hi Robert,
>
> currently there is no actual proposal, I was just trying to gather
> feedback from the community. But my original thoughts would be [1]. I
> actually don't see much need for restructuring the code by nesting
> directories. If the community sees that it would make sense to structure
> the dependencies, the second step would probably be to figure out how to
> accomplish this. I don't have any exact solution in mind so far, it
> would be probably needed to first identify features that are needed by
> SQL and not supported by Euphoria currently. Then we can actually
> identify costs and see it this still makes sense.
>
>   Jan
>
> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
> > Taking a step back, what exactly is the proposal. Looking at the
> > original message, I see
> >
> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
> > taking advantage of the logical nesting of levels of abstraction. This
> > makes sense to me.
> > (2) Nesting the directories (but not the gradle targets or module
> > names?). Here I'm not so sure about the benefit, especially vs. the
> > cost.
> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <[email protected]> wrote:
> >> I think that the fact that SQL uses some other internal dependency
> >> should remain hidden implementation detail. I absolutely agree that the
> >> dependency should of course remain sdks-java-sql in all cases.
> >>
> >>     Jan
> >>
> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
> >>> I suppose what I'm trying to say is that I see this module structure
> >>> as a tool for discoverability and enumerating end-user endpoints. In
> >>> other words, if one wants to use SQL, it would seem odd to have to
> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
> >>> sdks-java-euphoria is also a DSL one might use. A sibling relationship
> >>> does not prohibit the layered approach to implementation that sounds
> >>> like it makes sense.
> >>>
> >>> (As for merging Euphoria into core, my initial impression is that's
> >>> probably a good idea, and something we should consider for 3.0 at the
> >>> very least.)
> >>>
> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <[email protected]> wrote:
> >>>> Hi Rui,
> >>>>
> >>>> yes, there are optimizations that could be added by each layer. The
> purpose of Euphoria layer actually is not to reorder or modify any user
> operators that are present in the pipeline (because it might not have
> enough information to do this), but it can for instance choose between
> various join implementations (shuffle join, broadcast join, ...) - so the
> optimizations it can do are more low level. But this plays nicely with the
> DSL hierarchy - each layer adds a little more restrictions, but can
> therefore do more optimizations. And I think that the layer between SDK and
> SQL wouldn't have to support SQL optimizations, it would only have to
> support way for SQL to express these optimizations.
> >>>>
> >>>>     Jan ---------- Původní e-mail ----------
> >>>> Od: Rui Wang <[email protected]>
> >>>> Komu: [email protected]
> >>>> Datum: 30. 11. 2018 22:43:04
> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>>>
> >>>> SQL's optimization is another area to consider for integration. SQL
> optimization includes pushing down filters/projections, merging or removing
> or swapping plan nodes and comparing plan costs to choose best plan.  Add
> another layer between SQL and java core might need the layer to support SQL
> optimizations if there is a need.
> >>>>
> >>>> I don't have a clear image on what SQL needs from Euphoria for
> optimization(best case is nothing). As those optimizations are happening or
> will happen, we might start to have a sense of it.
> >>>>
> >>>> -Rui
> >>>>
> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <[email protected]>
> wrote:
> >>>>
> >>>> I don't really see Euphoria as a subset of SQL or the other way
> >>>> around, and I think it makes sense to use either without the other, so
> >>>> by this criteria keeping them as siblings than a nesting.
> >>>>
> >>>> That said, I think it's really good to have a bunch of shared code,
> >>>> e.g. a join library that could be used by both. One could even depend
> >>>> on the other without having to abandon the sibling relationship.
> >>>> Something like retractions belong in the core SDK itself. Deeper than
> >>>> that, actually, it should be part of the model.
> >>>>
> >>>> - Robert
> >>>>
> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <[email protected]>
> wrote:
> >>>>> Jan, we made Kryo optional recently (it is a separate module and is
> used only in tests). From a quick look it seems that we forgot to remove
> compile time dependency from euphoria's build.gradle. Only "strong"
> dependencies I'm aware of are core SDK and guava. We'll be probably adding
> sketching extension dependency soon.
> >>>>>
> >>>>> D.
> >>>>>
> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <[email protected]>
> wrote:
> >>>>>> Hi Anton,
> >>>>>> reactions inline.
> >>>>>>
> >>>>>> ---------- Původní e-mail ----------
> >>>>>> Od: Anton Kedin <[email protected]>
> >>>>>> Komu: [email protected]
> >>>>>> Datum: 30. 11. 2018 18:17:06
> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
> >>>>>>
> >>>>>> I think this approach makes sense in general, Euphoria can be the
> implementation detail of SQL, similar to Join Library or core SDK Schemas.
> >>>>>>
> >>>>>> I wonder though whether it would be better to bring Euphoria closer
> to core SDK first, maybe even merge them together. If you look at Reuven's
> recent work around schemas it seems like there are already similarities
> between that and Euphoria's approach, unless I'm missing the point (e.g.
> Filter transforms, FullJoin vs CoGroup... see [2]). And we're already
> switching parts of SQL to those transforms (e.g. SQL Aggregation is now
> implemented by core SDK's Group[3]).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
> has. Whether or not to merge Euphoria with core is essentially just a
> decision of the community (in my point of view).
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Adding explicit Schema support to Euphoria will bring it both
> closer to core SDK and make it natural to use for SQL. Can this be a first
> step towards this integration?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> Euphoria currently operates on pure PCollections, so when
> PCollection has a schema, it will be accessible by Euphoria. It makes sense
> to make use of the schema in Euphoria - it seems natural on inputs to
> Euphoria operators, but it might be tricky (not saying impossible) to
> actually produce schema-aware PCollections as outputs from Euphoria
> operators (generally speaking, in special cases that might be possible).
> Regarding inputs, there is actually intention to act on type of PCollection
> - e.g. when PCollection is already of type KV, then it is possible to make
> key extractor and value extractor optional in Euphoria builders, so it
> feels natural to enable changing the builders when a schema-aware
> PCollection, and make use of the provided schema. The rest of Euphoria team
> might correct me, if I'm wrong.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> One question I have is, does Euphoria bring dependencies that are
> not needed by SQL, or does more or less only rely on the core SDK?
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> I think the only relevant dependency that Euphoria has besides core
> SDK is Kryo. It is the default coder when no coder is provided, but that
> could be made optional - e.g. the default coder would be supported only if
> an appropriate module would be available. That way I think that Euphoria
> has no special dependencies.
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> [1]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
> >>>>>> [2]
> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
> >>>>>> [3]
> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <[email protected]>
> wrote:
> >>>>>>
> >>>>>> Hi community,
> >>>>>>
> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd like
> to
> >>>>>> discuss possible development of Java based DSLs currently present in
> >>>>>> Beam. In my knowledge, there are currently two DSLs based on Java
> SDK -
> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
> >>>>>> although there might be room to share some more effort. We already
> know
> >>>>>> that both Euphoria and SQL have need for retractions, but there are
> >>>>>> probably many more features that these two could share.
> >>>>>>
> >>>>>> So, I'd like to open a discussion on what it would cost and what it
> >>>>>> would possibly bring, if instead of the current structure
> >>>>>>
> >>>>>>      Java SDK
> >>>>>>
> >>>>>>        | ---- SQL
> >>>>>>
> >>>>>>        | ---- Euphoria
> >>>>>>
> >>>>>> these DSLs would be structured as
> >>>>>>
> >>>>>>      Java SDK ---> Euphoria ---> SQL
> >>>>>>
> >>>>>> I'm absolutely sure that this would be a great investment and a huge
> >>>>>> change, but I'd like to gather some opinions and general feelings
> of the
> >>>>>> community about this. Some points to start the discussion from my
> side
> >>>>>> would be, that structuring DSLs like this has internal logical
> >>>>>> consistency, because each API layer further narrows completeness,
> but
> >>>>>> brings simpler API for simpler tasks, while adding additional
> high-level
> >>>>>> view of the data processing pipeline and thus enabling more
> >>>>>> optimizations. On Euphoria side, these are various implementations
> joins
> >>>>>> (most effective implementation depends on data), pipeline sampling
> and
> >>>>>> more. Some (or maybe most) of these optimizations would have to be
> >>>>>> implemented in both DSLs, so implementing them once is beneficial.
> >>>>>> Another benefit is that this would bring Euphoria "closer" to Beam
> core
> >>>>>> development (which would be good, it is part of the project anyway,
> >>>>>> right? :)) and help better drive features, that although currently
> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
> >>>>>>
> >>>>>> Thanks for discussion and looking forward to any opinions.
> >>>>>>
> >>>>>>      Jan
> >>>>>>
>

Re: [DISCUSS] Structuring Java based DSLs

Reply via email to