Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Fabian Hueske Fri, 20 Jan 2017 15:31:33 -0800

Hi Shaoxuan,

thanks a lot for this great design doc.
I think user defined aggregation functions are a very important feature for
the Table API and SQL.


Have you thought about how the aggregation functions will be embedded in
Flink functions?
At the moment, we have a generic Flink function which is configured with
aggregation functions, i.e., we do not leverage code generation here.
Do you plan to embed built-in and user-defined aggregations functions that
implement the proposed API with code generation?

Can you maybe extend the JIRA or design document with this information?

Thank you,
Fabian

2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <[email protected]>:

> Hi everyone,
> I have drafted the design doc (link is provided below) for UDAGG, and
> created the JIRA (FLINK-5564) to track the progress of this design.
> Special thanks to Stephan and Fabian for their advice and help.
>
> Please check the design doc, feel free to share your comments in the google
> doc:
> https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6
> 7yXOypY7Uh5gIOK2r-U/edit
>
> Regards,
> Shaoxuan
>
> On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <[email protected]> wrote:
>
> > Hi Shaoxuan,
> >
> > user-defined aggregates would be a great addition to the Table API / SQL.
> > I completely agree that the current (internal) interface is not well
> suited
> > as an external interface and needs to be redesigned if exposed to users.
> >
> > We need to careful think about this new interface and how we can
> integrate
> > it with the DataStream (and DataSet) API to support all required
> > operations, esp. with respect to null aggregates and support for
> combining
> > / merging.
> > I agree that for efficient execution, we should avoid WindowFunctions
> > (large state) and FoldFunction (not mergeable). If we need a new
> interface
> > in the DataStream API, we need to discuss this in more detail.
> > I think we need a bit more information about the proposed UDAGG interface
> > to discuss how this can be mapped to DataStream operators.
> >
> > Support for retraction will be required for our future plans with the
> > streaming Table API / SQL interface.
> >
> > Looking forward to your proposal,
> > Fabian
> >
> > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <[email protected]>:
> >
> > > Hello everyone,
> > >
> > > I am writing this email to propose a new User Defined Aggregate
> > interface.
> > > We were trying to leverage the existing Aggregate interface, but
> > > unfortunately we realized that it is not sufficient to meet all our
> > needs.
> > > Here are the obstacles we have observed:
> > > 1) The current aggregate interface is not very concise to users. One
> > needs
> > > to know the design details of the intermediate Row buffer before
> > implements
> > > an Aggregate. Seven functions are needed even for a simple Count
> > aggregate.
> > > We'd better to make the UDAGG interface much more concisely.
> > > 2) the current aggregate function can be only applied on one single
> > column.
> > > There are many scenarios which require the aggregate function taking
> > > multiple columns as the inputs.
> > > 3) “Retraction” is not covered in the current Aggregate.
> > >
> > > For #1, I am thinking instead of letting users to manipulate the
> > > intermediate buffer, we could potentially put the entire Aggregate
> > instance
> > > or a subclass instance of Aggregate to the Row buffer, such that the
> user
> > > does not need to know how the Aggregate state is maintained by the
> > > framework.
> > > But to achieve this goal, we probably need a new dataStream API. The
> > > existing reduce API does not work with two different types of inputs
> (in
> > > this proposal, the inputs will be upstream values, and the instance of
> > the
> > > current accumulated Aggregate), while the fold API is not able to merge
> > the
> > > two Aggregate results (which is usually needed for merging two session
> > > windows).
> > >
> > > For #3, besides the aggregate itself, there are a few other things need
> > to
> > > be taken care of to fully support the retractions. I will share a
> > separate
> > > concrete proposal about how to generate and process retractions, and
> how
> > it
> > > works along with this new proposed UDAGG.
> > >
> > > I would like really appreciate if you can share your opinions on this
> > > proposal, especially for the needed dataStream API for #1. Also, if
> there
> > > is any other good things you think to be better added for UDAGG, please
> > > feel free to share with us. I will draft my proposal in a google doc
> and
> > > share to the flink DEV group very soon.
> > >
> > > Thanks,
> > > Shaoxuan
> > >
> >
>

Re: [DISCUSS] proposal for the User Defined AGGregate (UDAGG)

Reply via email to