Hi Shaoxuan, thanks a lot for this great design doc. I think user defined aggregation functions are a very important feature for the Table API and SQL.
Have you thought about how the aggregation functions will be embedded in Flink functions? At the moment, we have a generic Flink function which is configured with aggregation functions, i.e., we do not leverage code generation here. Do you plan to embed built-in and user-defined aggregations functions that implement the proposed API with code generation? Can you maybe extend the JIRA or design document with this information? Thank you, Fabian 2017-01-18 20:55 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > Hi everyone, > I have drafted the design doc (link is provided below) for UDAGG, and > created the JIRA (FLINK-5564) to track the progress of this design. > Special thanks to Stephan and Fabian for their advice and help. > > Please check the design doc, feel free to share your comments in the google > doc: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz6 > 7yXOypY7Uh5gIOK2r-U/edit > > Regards, > Shaoxuan > > On Wed, Jan 11, 2017 at 6:09 AM, Fabian Hueske <fhue...@gmail.com> wrote: > > > Hi Shaoxuan, > > > > user-defined aggregates would be a great addition to the Table API / SQL. > > I completely agree that the current (internal) interface is not well > suited > > as an external interface and needs to be redesigned if exposed to users. > > > > We need to careful think about this new interface and how we can > integrate > > it with the DataStream (and DataSet) API to support all required > > operations, esp. with respect to null aggregates and support for > combining > > / merging. > > I agree that for efficient execution, we should avoid WindowFunctions > > (large state) and FoldFunction (not mergeable). If we need a new > interface > > in the DataStream API, we need to discuss this in more detail. > > I think we need a bit more information about the proposed UDAGG interface > > to discuss how this can be mapped to DataStream operators. > > > > Support for retraction will be required for our future plans with the > > streaming Table API / SQL interface. > > > > Looking forward to your proposal, > > Fabian > > > > 2017-01-10 15:40 GMT+01:00 Shaoxuan Wang <wshaox...@gmail.com>: > > > > > Hello everyone, > > > > > > I am writing this email to propose a new User Defined Aggregate > > interface. > > > We were trying to leverage the existing Aggregate interface, but > > > unfortunately we realized that it is not sufficient to meet all our > > needs. > > > Here are the obstacles we have observed: > > > 1) The current aggregate interface is not very concise to users. One > > needs > > > to know the design details of the intermediate Row buffer before > > implements > > > an Aggregate. Seven functions are needed even for a simple Count > > aggregate. > > > We'd better to make the UDAGG interface much more concisely. > > > 2) the current aggregate function can be only applied on one single > > column. > > > There are many scenarios which require the aggregate function taking > > > multiple columns as the inputs. > > > 3) “Retraction” is not covered in the current Aggregate. > > > > > > For #1, I am thinking instead of letting users to manipulate the > > > intermediate buffer, we could potentially put the entire Aggregate > > instance > > > or a subclass instance of Aggregate to the Row buffer, such that the > user > > > does not need to know how the Aggregate state is maintained by the > > > framework. > > > But to achieve this goal, we probably need a new dataStream API. The > > > existing reduce API does not work with two different types of inputs > (in > > > this proposal, the inputs will be upstream values, and the instance of > > the > > > current accumulated Aggregate), while the fold API is not able to merge > > the > > > two Aggregate results (which is usually needed for merging two session > > > windows). > > > > > > For #3, besides the aggregate itself, there are a few other things need > > to > > > be taken care of to fully support the retractions. I will share a > > separate > > > concrete proposal about how to generate and process retractions, and > how > > it > > > works along with this new proposed UDAGG. > > > > > > I would like really appreciate if you can share your opinions on this > > > proposal, especially for the needed dataStream API for #1. Also, if > there > > > is any other good things you think to be better added for UDAGG, please > > > feel free to share with us. I will draft my proposal in a google doc > and > > > share to the flink DEV group very soon. > > > > > > Thanks, > > > Shaoxuan > > > > > >