Re: [DISCUSS] Structuring Java based DSLs

Jan Lukavský Tue, 04 Dec 2018 02:05:03 -0800

Hi Kenn,

my intent really was not to propose any changes right now. I'm trying tocreate a clear understanding about what the relation between Euphoriaand SQL should be in long run. In my point of view, Euphoria should bealways superset of SQL, because it should support complete relationalalgebra (and I'm not saying it does so right now, it should just be ourgoal) plus more flexible UDFs (not limited to SQL standard) and statefulprocessing (which will probably not be part of SQL any time soon). Thereshould be some sort of guaranties that the semantics of SQL and Euphoriaare the same, because that is what users would expect it to be. This canbe for sure ensured by introducing another layer between Euphoria andcore SDK (e.g. the join library), but the question is - what makes thissolution different from creating this shared library from Euphoriaitself (when looking at the big picture)? And it is not only aboutimplementations of joins or any other operators, but there are othertechniques that could be beneficial for SQL - e.g. pipeline sampling,automatic pipeline optimizations based on statistics from previous runsof batch queries, etc.

The other way - that relational algebra nodes will become essential partof (some) SDK, that is equivalent to actually creating SQL SDK, am Iright? I understand, that this approach can bring performance benefits,but besides that - is the language which implements SQL really importantfor users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs?How would the resulting SQL query look like? If it is about allowingusing SQL from all other SDKs (I want to do some basic preprocessingusing SQL and then optimize some hard part in my favorite SDK) - canthis be solved by enabling SQL in all SDKs by mixing various SDKsharnesses in single pipeline instead (e.g. I want to use SQL in Go SDK,I just tell the portable layer to run these operators using Java SDK andthese using Go)? That seems plausible, solving interoperability issues,while leaving the whole implementation of SQL as an internal detail.Generally this solves more issues, like ability to reuse IOs in all SDKs(I'm aware that there are caveats, but that is beyond scope of intendeddiscussion topic of this thread).


 Jan

On 12/3/18 7:27 PM, Kenneth Knowles wrote:

To be honest, I don't think there's much worth doing right now. Ithink more self-contained is better for Beam SQL, generally. Twothings I have on my mind are (1) SQL as an inline transform in everySDK and (2) supporting pure SQL like the CLI and JDBC driver, wherethe underlying language is an implementation detail.

Big picture / long term, I would envision pure SQL, embedded SQLtransform, and a DataFrame-like API in ~each SDK all desugaring torelational algebra nodes, sharing an optimizer, sharing some amount ofmapping the physical plan to Beam transforms. The necessarilySDK-specific parts are the embedded transform API and UDFs in the hostlanguage. The rest should remain an implementation detail that we canchange.

- For example, it is easy to imagine a customized columnarelement/bundle encoding and SDK harness that only works for SQL toremove overhead of being general purpose. It could be written inC/C++/Go if we wanted to squeeze it for perf. Such things are madeharder by having an elaborate end-user API between SQL and the coreBeam model. - Conversely, for whatever is chosen to underlie SQL's execution,stability is paramount. Ideally the simplest and least likely tochange transforms would be the foundation. And I wouldn't want to haveto design a user-friendly API for Euphoria or the join library just toenable a different join algorithm in SQL.

So my take is keep SQL flexible, implement SQL on low-level and stableAPIs, use join library, Euphoria, etc, if it looks like a big win, butdon't build any policy here or do big refactors right now.


Kenn

On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz<mailto:je...@seznam.cz>> wrote:


    Hi Robert,

    currently there is no actual proposal, I was just trying to gather
    feedback from the community. But my original thoughts would be [1]. I
    actually don't see much need for restructuring the code by nesting
    directories. If the community sees that it would make sense to
    structure
    the dependencies, the second step would probably be to figure out
    how to
    accomplish this. I don't have any exact solution in mind so far, it
    would be probably needed to first identify features that are
    needed by
    SQL and not supported by Euphoria currently. Then we can actually
    identify costs and see it this still makes sense.

      Jan

    On 12/3/18 6:17 PM, Robert Bradshaw wrote:
    > Taking a step back, what exactly is the proposal. Looking at the
    > original message, I see
    >
    > (1) Letting SQL take a dependency on Euphoria, sharing more code and
    > taking advantage of the logical nesting of levels of
    abstraction. This
    > makes sense to me.
    > (2) Nesting the directories (but not the gradle targets or module
    > names?). Here I'm not so sure about the benefit, especially vs. the
    > cost.
    > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz
    <mailto:je...@seznam.cz>> wrote:
    >> I think that the fact that SQL uses some other internal dependency
    >> should remain hidden implementation detail. I absolutely agree
    that the
    >> dependency should of course remain sdks-java-sql in all cases.
    >>
    >>     Jan
    >>
    >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
    >>> I suppose what I'm trying to say is that I see this module
    structure
    >>> as a tool for discoverability and enumerating end-user
    endpoints. In
    >>> other words, if one wants to use SQL, it would seem odd to have to
    >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
    >>> sdks-java-euphoria is also a DSL one might use. A sibling
    relationship
    >>> does not prohibit the layered approach to implementation that
    sounds
    >>> like it makes sense.
    >>>
    >>> (As for merging Euphoria into core, my initial impression is
    that's
    >>> probably a good idea, and something we should consider for 3.0
    at the
    >>> very least.)
    >>>
    >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz
    <mailto:je...@seznam.cz>> wrote:
    >>>> Hi Rui,
    >>>>
    >>>> yes, there are optimizations that could be added by each
    layer. The purpose of Euphoria layer actually is not to reorder or
    modify any user operators that are present in the pipeline
    (because it might not have enough information to do this), but it
    can for instance choose between various join implementations
    (shuffle join, broadcast join, ...) - so the optimizations it can
    do are more low level. But this plays nicely with the DSL
    hierarchy - each layer adds a little more restrictions, but can
    therefore do more optimizations. And I think that the layer
    between SDK and SQL wouldn't have to support SQL optimizations, it
    would only have to support way for SQL to express these optimizations.
    >>>>
    >>>>     Jan ---------- Původní e-mail ----------
    >>>> Od: Rui Wang <ruw...@google.com <mailto:ruw...@google.com>>
    >>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
    >>>> Datum: 30. 11. 2018 22:43:04
    >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
    >>>>
    >>>> SQL's optimization is another area to consider for
    integration. SQL optimization includes pushing down
    filters/projections, merging or removing or swapping plan nodes
    and comparing plan costs to choose best plan.  Add another layer
    between SQL and java core might need the layer to support SQL
    optimizations if there is a need.
    >>>>
    >>>> I don't have a clear image on what SQL needs from Euphoria
    for optimization(best case is nothing). As those optimizations are
    happening or will happen, we might start to have a sense of it.
    >>>>
    >>>> -Rui
    >>>>
    >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw
    <rober...@google.com <mailto:rober...@google.com>> wrote:
    >>>>
    >>>> I don't really see Euphoria as a subset of SQL or the other way
    >>>> around, and I think it makes sense to use either without the
    other, so
    >>>> by this criteria keeping them as siblings than a nesting.
    >>>>
    >>>> That said, I think it's really good to have a bunch of shared
    code,
    >>>> e.g. a join library that could be used by both. One could
    even depend
    >>>> on the other without having to abandon the sibling relationship.
    >>>> Something like retractions belong in the core SDK itself.
    Deeper than
    >>>> that, actually, it should be part of the model.
    >>>>
    >>>> - Robert
    >>>>
    >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek
    <d...@apache.org <mailto:d...@apache.org>> wrote:
    >>>>> Jan, we made Kryo optional recently (it is a separate module
    and is used only in tests). From a quick look it seems that we
    forgot to remove compile time dependency from euphoria's
    build.gradle. Only "strong" dependencies I'm aware of are core SDK
    and guava. We'll be probably adding sketching extension dependency
    soon.
    >>>>>
    >>>>> D.
    >>>>>
    >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský
    <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
    >>>>>> Hi Anton,
    >>>>>> reactions inline.
    >>>>>>
    >>>>>> ---------- Původní e-mail ----------
    >>>>>> Od: Anton Kedin <ke...@google.com <mailto:ke...@google.com>>
    >>>>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
    >>>>>> Datum: 30. 11. 2018 18:17:06
    >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
    >>>>>>
    >>>>>> I think this approach makes sense in general, Euphoria can
    be the implementation detail of SQL, similar to Join Library or
    core SDK Schemas.
    >>>>>>
    >>>>>> I wonder though whether it would be better to bring
    Euphoria closer to core SDK first, maybe even merge them together.
    If you look at Reuven's recent work around schemas it seems like
    there are already similarities between that and Euphoria's
    approach, unless I'm missing the point (e.g. Filter transforms,
    FullJoin vs CoGroup... see [2]). And we're already switching parts
    of SQL to those transforms (e.g. SQL Aggregation is now
    implemented by core SDK's Group[3]).
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> Yes, these transforms seem to be very similar to those
    Euphoria has. Whether or not to merge Euphoria with core is
    essentially just a decision of the community (in my point of view).
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> Adding explicit Schema support to Euphoria will bring it
    both closer to core SDK and make it natural to use for SQL. Can
    this be a first step towards this integration?
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> Euphoria currently operates on pure PCollections, so when
    PCollection has a schema, it will be accessible by Euphoria. It
    makes sense to make use of the schema in Euphoria - it seems
    natural on inputs to Euphoria operators, but it might be tricky
    (not saying impossible) to actually produce schema-aware
    PCollections as outputs from Euphoria operators (generally
    speaking, in special cases that might be possible). Regarding
    inputs, there is actually intention to act on type of PCollection
    - e.g. when PCollection is already of type KV, then it is possible
    to make key extractor and value extractor optional in Euphoria
    builders, so it feels natural to enable changing the builders when
    a schema-aware PCollection, and make use of the provided schema.
    The rest of Euphoria team might correct me, if I'm wrong.
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> One question I have is, does Euphoria bring dependencies
    that are not needed by SQL, or does more or less only rely on the
    core SDK?
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> I think the only relevant dependency that Euphoria has
    besides core SDK is Kryo. It is the default coder when no coder is
    provided, but that could be made optional - e.g. the default coder
    would be supported only if an appropriate module would be
    available. That way I think that Euphoria has no special dependencies.
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> [1]
    
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
    >>>>>> [2]
    
https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
    >>>>>> [3]
    
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
    >>>>>>
    >>>>>>
    >>>>>>
    >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský
    <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
    >>>>>>
    >>>>>> Hi community,
    >>>>>>
    >>>>>> I'm part of Euphoria DSL team, and on behalf of this team,
    I'd like to
    >>>>>> discuss possible development of Java based DSLs currently
    present in
    >>>>>> Beam. In my knowledge, there are currently two DSLs based
    on Java SDK -
    >>>>>> Euphoria and SQL. These DSLs currently share only the SDK
    itself,
    >>>>>> although there might be room to share some more effort. We
    already know
    >>>>>> that both Euphoria and SQL have need for retractions, but
    there are
    >>>>>> probably many more features that these two could share.
    >>>>>>
    >>>>>> So, I'd like to open a discussion on what it would cost and
    what it
    >>>>>> would possibly bring, if instead of the current structure
    >>>>>>
    >>>>>>      Java SDK
    >>>>>>
    >>>>>>        | ---- SQL
    >>>>>>
    >>>>>>        | ---- Euphoria
    >>>>>>
    >>>>>> these DSLs would be structured as
    >>>>>>
    >>>>>>      Java SDK ---> Euphoria ---> SQL
    >>>>>>
    >>>>>> I'm absolutely sure that this would be a great investment
    and a huge
    >>>>>> change, but I'd like to gather some opinions and general
    feelings of the
    >>>>>> community about this. Some points to start the discussion
    from my side
    >>>>>> would be, that structuring DSLs like this has internal logical
    >>>>>> consistency, because each API layer further narrows
    completeness, but
    >>>>>> brings simpler API for simpler tasks, while adding
    additional high-level
    >>>>>> view of the data processing pipeline and thus enabling more
    >>>>>> optimizations. On Euphoria side, these are various
    implementations joins
    >>>>>> (most effective implementation depends on data), pipeline
    sampling and
    >>>>>> more. Some (or maybe most) of these optimizations would
    have to be
    >>>>>> implemented in both DSLs, so implementing them once is
    beneficial.
    >>>>>> Another benefit is that this would bring Euphoria "closer"
    to Beam core
    >>>>>> development (which would be good, it is part of the project
    anyway,
    >>>>>> right? :)) and help better drive features, that although
    currently
    >>>>>> needed mostly by SQL, might be needed by other Java users
    anyway.
    >>>>>>
    >>>>>> Thanks for discussion and looking forward to any opinions.
    >>>>>>
    >>>>>>      Jan
    >>>>>>

Re: [DISCUSS] Structuring Java based DSLs

Reply via email to