Re: [YAML] Aggregations

Robert Bradshaw via dev Thu, 19 Oct 2023 12:06:23 -0700

On Thu, Oct 19, 2023 at 11:42 AM Jan Lukavský <[email protected]> wrote:
>
> On 10/19/23 19:41, Robert Bradshaw via dev wrote:
> > On Thu, Oct 19, 2023 at 10:25 AM Jan Lukavský <[email protected]> wrote:
> >> On 10/19/23 18:28, Robert Bradshaw via dev wrote:
> >>> On Thu, Oct 19, 2023 at 9:00 AM Byron Ellis <[email protected]> wrote:
> >>>> Rill is definitely SQL-oriented but I think that's going to be the most 
> >>>> common. Dataframes are explicitly modeled on the relational approach so 
> >>>> that's going to look a lot like SQL,
> >>> I think pretty much any approach that fits here is going to be
> >>> relational, meaning you choose a set of columns to group on, a set of
> >>> columns to aggregate, and how to aggregate. The big open question is
> >>> what syntax to use for the "how."
> >> This might be already answered, if so, pardon my ignorance, but what is
> >> the goal this declarative approach is trying to solve? Is it meant to be
> >> more expressive or equally expressive than SQL? And if more, how much more?
> > I'm not sure if you're asking about YAML in general, or the particular
> > case of aggregation, but I can answer both.
> >
> > For the larger Beam YAML project, it's trying to solve the problem
> > that SQL is (and I'll admit this is somewhat subjective here) good at
> > expressing the T part of ETL, but not the other parts. For example,
> > the simple data movent usecase of (say) reading from PubSub and
> > dumping into BigQuery is not well expressed in terms of SQL. SQL is
> > also fairly awkward when it comes to defining UDFs and TDFs and
> > non-linear pipelines (especially those with fanout). There are of
> > course other tools in this space (dbt comes to mind, and there's been
> > some investigation on how to make dbt play well with Beam). The other
> > niche it is trying to solve is that installing and learning a full SDK
> > is heavyweight and overkill for creating pipelines that are simply
> > wiring together pre-defined transforms.
>
> I think FlinkSQL solves the problem of E and L in SQL via CREATE TABLE
> and INSERT statements. I agree with the fanout part, though it could be
> possible to use CREATE (TEMPORARY) TABLE AS SELECT ... could solve that
> as well.


Yeah, Beam uses the CREATE TABLE for referring to external data too.
(I don't remember about INSERT statements). This is where (IMHO of
course) SQL starts to get very messy (and non-standard).

> > As for the more narrow case of aggregations, I think being similarly
> > expressive as SQL is fine, though it'd be good to make custom UADFs
> > more natural. Originally I was thinking that just having SqlTransform
> > might be sufficient, but it feels like a big hammer to reach for every
> > time I just want to sum over one or two columns.
>
> Yes, defining UDFs and UDAFs is painful, that was the motivation of my
> question. It also defines how the syntax for such UDAF would need to
> look like. It would require to break UDAFs down to several primitive
> UDFs and then use a functional style to declare them. Most of the time
> it would be probably sufficient to use simplified CombineFn semantics
> with accumulator being limited to a primitive type (long, double,
> string, maybe array?). I suppose declaring a full-blown stateful DoFn
> (timers, generic state, ...) is out of scope.

Other than the possible totally-commutative-associative V* -> V case,
I'm probably fine with referencing existing CombineFns (including ones
defined in files like we do with mapping fns) rather than providing a
full YAML syntax for defining them. But we do need to be able to
parameterize them.

Re: [YAML] Aggregations

Reply via email to