[jira] [Closed] (FLINK-12744) ML common parameters
[ https://issues.apache.org/jira/browse/FLINK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-12744. - Resolution: Fixed Fix Version/s: 1.9.0 > ML common parameters > > > Key: FLINK-12744 > URL: https://issues.apache.org/jira/browse/FLINK-12744 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Xu Yang >Assignee: Xu Yang >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 10m > Remaining Estimate: 0h > > We defined some common-used parameters for machine-learning algorithms. > - *add ML common parameters* > - *change behavior when use default constructor of param factory* > - *add shared params in ml package* > - *add flink-ml module* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12744) ML common parameters
[ https://issues.apache.org/jira/browse/FLINK-12744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879343#comment-16879343 ] Shaoxuan Wang commented on FLINK-12744: --- Fixed in master: f18481ee54b61a737ae1426ff4dcaf7006e0edbd > ML common parameters > > > Key: FLINK-12744 > URL: https://issues.apache.org/jira/browse/FLINK-12744 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Xu Yang >Assignee: Xu Yang >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > We defined some common-used parameters for machine-learning algorithms. > - *add ML common parameters* > - *change behavior when use default constructor of param factory* > - *add shared params in ml package* > - *add flink-ml module* -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-12758) Add flink-ml-lib module
[ https://issues.apache.org/jira/browse/FLINK-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-12758. - Resolution: Fixed > Add flink-ml-lib module > --- > > Key: FLINK-12758 > URL: https://issues.apache.org/jira/browse/FLINK-12758 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Luo Gen >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The Jira introduces a new module "flink-ml-lib" under flink-ml-parent. > The flink-ml-lib is planned in the roadmap in FLIP-39, as the code base of > library implementations of FlinkML. This Jira only aims to create the module, > and algorithms will be added in separate Jira in the future. > For more details, please refer to [FLIP39 design > doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12758) Add flink-ml-lib module
[ https://issues.apache.org/jira/browse/FLINK-12758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879055#comment-16879055 ] Shaoxuan Wang commented on FLINK-12758: --- Fixed in master: 726d9e49905e893659a1f7b0ba83a0a59bec8fac > Add flink-ml-lib module > --- > > Key: FLINK-12758 > URL: https://issues.apache.org/jira/browse/FLINK-12758 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Luo Gen >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > The Jira introduces a new module "flink-ml-lib" under flink-ml-parent. > The flink-ml-lib is planned in the roadmap in FLIP-39, as the code base of > library implementations of FlinkML. This Jira only aims to create the module, > and algorithms will be added in separate Jira in the future. > For more details, please refer to [FLIP39 design > doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-12597) Remove the legacy flink-libraries/flink-ml
[ https://issues.apache.org/jira/browse/FLINK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-12597. - Resolution: Fixed Fix Version/s: 1.9.0 > Remove the legacy flink-libraries/flink-ml > -- > > Key: FLINK-12597 > URL: https://issues.apache.org/jira/browse/FLINK-12597 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Affects Versions: 1.9.0 >Reporter: Shaoxuan Wang >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > As discussed in dev-ml, we decided to delete the legacy > flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There > is not any further development planned for this legacy flink-ml package in > 1.9 or even future. Users could just use the 1.8 version if their > products/projects still rely on this package. > [1] > [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12597) Remove the legacy flink-libraries/flink-ml
[ https://issues.apache.org/jira/browse/FLINK-12597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16879052#comment-16879052 ] Shaoxuan Wang commented on FLINK-12597: --- Fixed in master: 2b2a83df56242aa90ee731f25d17b050b75df0f3 > Remove the legacy flink-libraries/flink-ml > -- > > Key: FLINK-12597 > URL: https://issues.apache.org/jira/browse/FLINK-12597 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Affects Versions: 1.9.0 >Reporter: Shaoxuan Wang >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > As discussed in dev-ml, we decided to delete the legacy > flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There > is not any further development planned for this legacy flink-ml package in > 1.9 or even future. Users could just use the 1.8 version if their > products/projects still rely on this package. > [1] > [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-12881) Add more functionalities for ML Params and ParamInfo class
[ https://issues.apache.org/jira/browse/FLINK-12881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-12881. - Resolution: Fixed Fix Version/s: 1.9.0 > Add more functionalities for ML Params and ParamInfo class > -- > > Key: FLINK-12881 > URL: https://issues.apache.org/jira/browse/FLINK-12881 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Xu Yang >Assignee: Xu Yang >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Time Spent: 20m > Remaining Estimate: 0h > > 1. change Params$get to support aliases > * support aliases > * if the Params contains the specific parameter and alias, but has more than > one value or > * if the Params doesn't contains the specific parameter, while the ParamInfo > is optional but has no default value, it will throw exception > * when isOptional is true, contain(ParamInfo) is true, it will return the > value found in Params, whether the value is null or not. when isOptional is > true, contain(ParamInfo) is false: hasDefaultValue is true, it will return > defaultValue. hasDefaultValue is false, it will throw exception. developer > should use contain to check that Params has ParamInfo or not. > 2. add size, clear, isEmpty, contains, fromJson in Params > 3. fix ParamInfo, PipelineStage to adapt new Params > * assert null in alias > * use Params$loadJson in PipelineStage > 4. add test cases about aliases -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12881) Add more functionalities for ML Params and ParamInfo class
[ https://issues.apache.org/jira/browse/FLINK-12881?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875719#comment-16875719 ] Shaoxuan Wang commented on FLINK-12881: --- Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90 > Add more functionalities for ML Params and ParamInfo class > -- > > Key: FLINK-12881 > URL: https://issues.apache.org/jira/browse/FLINK-12881 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Xu Yang >Assignee: Xu Yang >Priority: Major > Labels: pull-request-available > Time Spent: 20m > Remaining Estimate: 0h > > 1. change Params$get to support aliases > * support aliases > * if the Params contains the specific parameter and alias, but has more than > one value or > * if the Params doesn't contains the specific parameter, while the ParamInfo > is optional but has no default value, it will throw exception > * when isOptional is true, contain(ParamInfo) is true, it will return the > value found in Params, whether the value is null or not. when isOptional is > true, contain(ParamInfo) is false: hasDefaultValue is true, it will return > defaultValue. hasDefaultValue is false, it will throw exception. developer > should use contain to check that Params has ParamInfo or not. > 2. add size, clear, isEmpty, contains, fromJson in Params > 3. fix ParamInfo, PipelineStage to adapt new Params > * assert null in alias > * use Params$loadJson in PipelineStage > 4. add test cases about aliases -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs
[ https://issues.apache.org/jira/browse/FLINK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-12470: -- Comment: was deleted (was: Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90) > FLIP39: Flink ML pipeline and ML libs > - > > Key: FLINK-12470 > URL: https://issues.apache.org/jira/browse/FLINK-12470 > Project: Flink > Issue Type: New Feature > Components: Library / Machine Learning >Affects Versions: 1.9.0 >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang >Priority: Major > Fix For: 1.9.0 > > Original Estimate: 720h > Remaining Estimate: 720h > > This is the umbrella Jira for FLIP39, which intents to to enhance the > scalability and the ease of use of Flink ML. > ML Discussion thread: > [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html] > Google Doc: (will convert it to an official confluence page very soon ) > [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit] > In machine learning, there are mainly two types of people. The first type is > MLlib developer. They need a set of standard/well abstracted core ML APIs to > implement the algorithms. Every ML algorithm is a certain concrete > implementation on top of these APIs. The second type is MLlib users who > utilize the existing/packaged MLlib to train or server a model. It is pretty > common that the entire training or inference is constructed by a sequence of > transformation or algorithms. It is essential to provide a workflow/pipeline > API for MLlib users such that they can easily combine multiple algorithms to > describe the ML workflow/pipeline. > Current Flink has a set of ML core inferences, but they are built on top of > dataset API. This does not quite align with the latest flink > [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the > first class citizen and primary API for analytics use cases, while dataset > API will be gradually deprecated). Moreover, Flink at present does not have > any interface that allows MLlib users to describe an ML workflow/pipeline, > nor provides any approach to persist pipeline or model and reuse them in the > future. To solve/improve these issues, in this FLIP we propose to: > * Provide a new set of ML core interface (on top of Flink TableAPI) > * Provide a ML pipeline interface (on top of Flink TableAPI) > * Provide the interfaces for parameters management and pipeline persistence > * All the above interfaces should facilitate any new ML algorithm. We will > gradually add various standard ML algorithms on top of these new proposed > interfaces to ensure their feasibility and scalability. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs
[ https://issues.apache.org/jira/browse/FLINK-12470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16875717#comment-16875717 ] Shaoxuan Wang commented on FLINK-12470: --- Fixed in master:1660c6b47af789fa5c9bf6a3ff77e868ca90 > FLIP39: Flink ML pipeline and ML libs > - > > Key: FLINK-12470 > URL: https://issues.apache.org/jira/browse/FLINK-12470 > Project: Flink > Issue Type: New Feature > Components: Library / Machine Learning >Affects Versions: 1.9.0 >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang >Priority: Major > Fix For: 1.9.0 > > Original Estimate: 720h > Remaining Estimate: 720h > > This is the umbrella Jira for FLIP39, which intents to to enhance the > scalability and the ease of use of Flink ML. > ML Discussion thread: > [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html] > Google Doc: (will convert it to an official confluence page very soon ) > [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit] > In machine learning, there are mainly two types of people. The first type is > MLlib developer. They need a set of standard/well abstracted core ML APIs to > implement the algorithms. Every ML algorithm is a certain concrete > implementation on top of these APIs. The second type is MLlib users who > utilize the existing/packaged MLlib to train or server a model. It is pretty > common that the entire training or inference is constructed by a sequence of > transformation or algorithms. It is essential to provide a workflow/pipeline > API for MLlib users such that they can easily combine multiple algorithms to > describe the ML workflow/pipeline. > Current Flink has a set of ML core inferences, but they are built on top of > dataset API. This does not quite align with the latest flink > [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the > first class citizen and primary API for analytics use cases, while dataset > API will be gradually deprecated). Moreover, Flink at present does not have > any interface that allows MLlib users to describe an ML workflow/pipeline, > nor provides any approach to persist pipeline or model and reuse them in the > future. To solve/improve these issues, in this FLIP we propose to: > * Provide a new set of ML core interface (on top of Flink TableAPI) > * Provide a ML pipeline interface (on top of Flink TableAPI) > * Provide the interfaces for parameters management and pipeline persistence > * All the above interfaces should facilitate any new ML algorithm. We will > gradually add various standard ML algorithms on top of these new proposed > interfaces to ensure their feasibility and scalability. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-12473) Add the interface of ML pipeline and ML lib
[ https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-12473. - Resolution: Fixed Fix Version/s: 1.9.0 Fixed in master:305095743ffe0bc39f76c1bda983da7d0df9e003 > Add the interface of ML pipeline and ML lib > --- > > Key: FLINK-12473 > URL: https://issues.apache.org/jira/browse/FLINK-12473 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Shaoxuan Wang >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Fix For: 1.9.0 > > Attachments: image-2019-05-10-12-50-15-869.png > > Original Estimate: 168h > Time Spent: 0.5h > Remaining Estimate: 167.5h > > This Jira will introduce the major interfaces for ML pipeline and ML lib. > The major interfaces and their relationship diagram is shown as below. For > more details, please refer to [FLIP39 design > doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] > > !image-2019-05-10-12-50-15-869.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12597) Remove the legacy flink-libraries/flink-ml
Shaoxuan Wang created FLINK-12597: - Summary: Remove the legacy flink-libraries/flink-ml Key: FLINK-12597 URL: https://issues.apache.org/jira/browse/FLINK-12597 Project: Flink Issue Type: Sub-task Components: Library / Machine Learning Affects Versions: 1.9.0 Reporter: Shaoxuan Wang Assignee: Luo Gen As discussed in dev-ml, we decided to delete the legacy flink-libraries/flink-ml, so as to the flink-libraries/flink-ml-uber. There is not any further development planned for this legacy flink-ml package in 1.9 or even future. Users could just use the 1.8 version if their products/projects still rely on this package. [1] [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/SURVEY-Usage-of-flink-ml-and-DISCUSS-Delete-flink-ml-td29057.html] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (FLINK-12473) Add the interface of ML pipeline and ML lib
[ https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-12473: - Assignee: Luo Gen > Add the interface of ML pipeline and ML lib > --- > > Key: FLINK-12473 > URL: https://issues.apache.org/jira/browse/FLINK-12473 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Shaoxuan Wang >Assignee: Luo Gen >Priority: Major > Labels: pull-request-available > Attachments: image-2019-05-10-12-50-15-869.png > > Original Estimate: 168h > Time Spent: 10m > Remaining Estimate: 167h 50m > > This Jira will introduce the major interfaces for ML pipeline and ML lib. > The major interfaces and their relationship diagram is shown as below. For > more details, please refer to [FLIP39 design > doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] > > !image-2019-05-10-12-50-15-869.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-12473) Add the interface of ML pipeline and ML lib
[ https://issues.apache.org/jira/browse/FLINK-12473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-12473: -- Summary: Add the interface of ML pipeline and ML lib (was: Add ML pipeline and ML lib Core API) > Add the interface of ML pipeline and ML lib > --- > > Key: FLINK-12473 > URL: https://issues.apache.org/jira/browse/FLINK-12473 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Shaoxuan Wang >Priority: Major > Attachments: image-2019-05-10-12-50-15-869.png > > Original Estimate: 168h > Remaining Estimate: 168h > > This Jira will introduce the major interfaces for ML pipeline and ML lib. > The major interfaces and their relationship diagram is shown as below. For > more details, please refer to [FLIP39 design > doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] > > !image-2019-05-10-12-50-15-869.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12473) Add ML pipeline and ML lib Core API
Shaoxuan Wang created FLINK-12473: - Summary: Add ML pipeline and ML lib Core API Key: FLINK-12473 URL: https://issues.apache.org/jira/browse/FLINK-12473 Project: Flink Issue Type: Sub-task Components: Library / Machine Learning Reporter: Shaoxuan Wang Attachments: image-2019-05-10-12-50-15-869.png This Jira will introduce the major interfaces for ML pipeline and ML lib. The major interfaces and their relationship diagram is shown as below. For more details, please refer to [FLIP39 design doc|[https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo]] !image-2019-05-10-12-50-15-869.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11115) Port some flink.ml algorithms to table based
[ https://issues.apache.org/jira/browse/FLINK-5?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-5. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Port some flink.ml algorithms to table based > > > Key: FLINK-5 > URL: https://issues.apache.org/jira/browse/FLINK-5 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > This sub-task is to port some flink.ml algorithms to table based to verify > the correctness of design and implementation. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11114) Support wrapping inference pipeline as a UDF function in SQL
[ https://issues.apache.org/jira/browse/FLINK-4?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-4. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Support wrapping inference pipeline as a UDF function in SQL > > > Key: FLINK-4 > URL: https://issues.apache.org/jira/browse/FLINK-4 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > Though the standard ML pipeline inference usage is table based (that is, user > at client construct the DAG only using table API), it is also desirable to > wrap the inference logic as a UDF to be used in SQL. This means it shall be > record based. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11113) Support periodically update models when inferencing
[ https://issues.apache.org/jira/browse/FLINK-3?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-3. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Support periodically update models when inferencing > --- > > Key: FLINK-3 > URL: https://issues.apache.org/jira/browse/FLINK-3 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > As models will be periodically updated and the inference job may running on > stream and will NOT finish, it is important to have this inference job > periodically reload the latest model for inference without start/stop the > inference job. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11111) Create a new set of parameters
[ https://issues.apache.org/jira/browse/FLINK-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-1. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Create a new set of parameters > -- > > Key: FLINK-1 > URL: https://issues.apache.org/jira/browse/FLINK-1 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > One goal of new Table based ML Pipeline is easy for tooling. That is, for any > ML/AI algorithm adapt to this ML Pipeline standard shall declare all its > parameters via a well-defined interface. So that, the AI platform can > uniformly get/set corresponding parameters while agnostic about the details > of specific algorithm. The only difference between algorithms, from a user's > perspective, is its name. All the other algorithm parameters are self > descriptive. > > This will also be useful for future Flink ML SQL as the SQL parser can > uniformly handle all these parameter things. This can greatly simplify the > SQL parser. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11112) Support pipeline import/export
[ https://issues.apache.org/jira/browse/FLINK-2?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-2. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Support pipeline import/export > -- > > Key: FLINK-2 > URL: https://issues.apache.org/jira/browse/FLINK-2 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > It is quite common to have one job for training and periodical export trained > pipeline and models and another job to load these exported pipeline for > inference. > > Thus, we will need functionalities for pipeline import/export. This shall > work in both streaming/batch environment. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11109) Create Table based optimizers
[ https://issues.apache.org/jira/browse/FLINK-11109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-11109. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Create Table based optimizers > - > > Key: FLINK-11109 > URL: https://issues.apache.org/jira/browse/FLINK-11109 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > The existing optimizers in org.apache.flink.ml package are dataset based. > This task is to create a new set of optimizers which are table based. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11108) Create a new set of table based ML Pipeline classes
[ https://issues.apache.org/jira/browse/FLINK-11108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-11108. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Create a new set of table based ML Pipeline classes > --- > > Key: FLINK-11108 > URL: https://issues.apache.org/jira/browse/FLINK-11108 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > The main classes are: > # PipelineStage (the trait for each pipeline stage) > # Estimator (training stage) > # Transformer (the inference/feature engineering stage) > # Pipeline (the whole pipeline) > # Predictor (extends Estimator, for supervised learning) > Detailed design can be referred at parent JIRA's design document. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11096) Create a new table based flink ML package
[ https://issues.apache.org/jira/browse/FLINK-11096?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-11096. - Resolution: Duplicate We decided to move the entire ml pipeline and ml lib development under the umbrella Jira (FLINK-12470) of FLIP39 > Create a new table based flink ML package > - > > Key: FLINK-11096 > URL: https://issues.apache.org/jira/browse/FLINK-11096 > Project: Flink > Issue Type: Sub-task > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > Currently, the DataSet based ML library is under org.apache._flink.ml_ scala > package and under _flink-libraries/flink-ml directory._ > > There are two questions related to packaging: > # Shall we create a new scala/java package, e.g. org.apache.flink.table.ml? > Or still stay in org.apache.flink.ml? > # Shall we still put new code in flink-libraries/flink-ml directory or > create a new one, e.g. flink-libraries/flink-table-ml and corresponding maven > package? > > I implemented a prototype for the design and found that the new design is > very hard to fit into existing flink.ml codebase. The existing flink.ml code > is tightly coupled with DataSet API. Thus, I have to rewrite almost all parts > of flink.ml to get some sample case to work. The only reusable code from > flink.ml are the base math classes under _org.apache.flink.ml.math_ and > _org.apache.flink.ml.metrics.distance_ packages. > Considering this fact, I will prefer to create a new package > org.apache.flink.table.ml and a new maven package flink-table-ml. > > Please feel free to give your feedbacks. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-11095) Table based ML Pipeline
[ https://issues.apache.org/jira/browse/FLINK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16836837#comment-16836837 ] Shaoxuan Wang edited comment on FLINK-11095 at 5/10/19 2:48 AM: We decided to move the entire ml pipeline and ml lib development under the umbrella Jira ([FLINK-12470|https://issues.apache.org/jira/browse/FLINK-12470]) of FLIP39 was (Author: shaoxuanwang): We decided to move the entire ml pipeline and ml lib development under FLIP39 > Table based ML Pipeline > --- > > Key: FLINK-11095 > URL: https://issues.apache.org/jira/browse/FLINK-11095 > Project: Flink > Issue Type: New Feature > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > As Table API will be the unified high level API for both batch and streaming, > it is desired to have a table API based ML Pipeline definition. > This new table based ML Pipeline shall: > # support unified batch/stream ML/AI functionalities (train/inference). > # seamless integrated with flink Table based ecosystem. > # provide a base for further flink based AI platform/tooling support. > # provide a base for further flink ML SQL integration. > The initial design is here > [https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing.] > And based on this design, I made some initial implementation/prototyping. I > will share the code later. > This is the umbrella JIRA. I will create corresponding sub-jira for each > sub-task. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11095) Table based ML Pipeline
[ https://issues.apache.org/jira/browse/FLINK-11095?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-11095. - Resolution: Fixed We decided to move the entire ml pipeline and ml lib development under FLIP39 > Table based ML Pipeline > --- > > Key: FLINK-11095 > URL: https://issues.apache.org/jira/browse/FLINK-11095 > Project: Flink > Issue Type: New Feature > Components: Library / Machine Learning >Reporter: Weihua Jiang >Priority: Major > > As Table API will be the unified high level API for both batch and streaming, > it is desired to have a table API based ML Pipeline definition. > This new table based ML Pipeline shall: > # support unified batch/stream ML/AI functionalities (train/inference). > # seamless integrated with flink Table based ecosystem. > # provide a base for further flink based AI platform/tooling support. > # provide a base for further flink ML SQL integration. > The initial design is here > [https://docs.google.com/document/d/1PLddLEMP_wn4xHwi6069f3vZL7LzkaP0MN9nAB63X90/edit?usp=sharing.] > And based on this design, I made some initial implementation/prototyping. I > will share the code later. > This is the umbrella JIRA. I will create corresponding sub-jira for each > sub-task. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-12470) FLIP39: Flink ML pipeline and ML libs
Shaoxuan Wang created FLINK-12470: - Summary: FLIP39: Flink ML pipeline and ML libs Key: FLINK-12470 URL: https://issues.apache.org/jira/browse/FLINK-12470 Project: Flink Issue Type: New Feature Components: Library / Machine Learning Affects Versions: 1.9.0 Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Fix For: 1.9.0 This is the umbrella Jira for FLIP39, which intents to to enhance the scalability and the ease of use of Flink ML. ML Discussion thread: [http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-39-Flink-ML-pipeline-and-ML-libs-td28633.html] Google Doc: (will convert it to an official confluence page very soon ) [https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo|https://docs.google.com/document/d/1StObo1DLp8iiy0rbukx8kwAJb0BwDZrQrMWub3DzsEo/edit] In machine learning, there are mainly two types of people. The first type is MLlib developer. They need a set of standard/well abstracted core ML APIs to implement the algorithms. Every ML algorithm is a certain concrete implementation on top of these APIs. The second type is MLlib users who utilize the existing/packaged MLlib to train or server a model. It is pretty common that the entire training or inference is constructed by a sequence of transformation or algorithms. It is essential to provide a workflow/pipeline API for MLlib users such that they can easily combine multiple algorithms to describe the ML workflow/pipeline. Current Flink has a set of ML core inferences, but they are built on top of dataset API. This does not quite align with the latest flink [roadmap|https://flink.apache.org/roadmap.html] (TableAPI will become the first class citizen and primary API for analytics use cases, while dataset API will be gradually deprecated). Moreover, Flink at present does not have any interface that allows MLlib users to describe an ML workflow/pipeline, nor provides any approach to persist pipeline or model and reuse them in the future. To solve/improve these issues, in this FLIP we propose to: * Provide a new set of ML core interface (on top of Flink TableAPI) * Provide a ML pipeline interface (on top of Flink TableAPI) * Provide the interfaces for parameters management and pipeline persistence * All the above interfaces should facilitate any new ML algorithm. We will gradually add various standard ML algorithms on top of these new proposed interfaces to ensure their feasibility and scalability. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-11004) Wrong ProcessWindowFunction.process argument in example of Incremental Window Aggregation with ReduceFunction
[ https://issues.apache.org/jira/browse/FLINK-11004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-11004. - Resolution: Fixed Fixed and merged into master > Wrong ProcessWindowFunction.process argument in example of Incremental Window > Aggregation with ReduceFunction > - > > Key: FLINK-11004 > URL: https://issues.apache.org/jira/browse/FLINK-11004 > Project: Flink > Issue Type: Bug > Components: Documentation >Affects Versions: 1.4.2, 1.5.5, 1.6.2 >Reporter: Yuanyang Wu >Priority: Major > Labels: pull-request-available > Original Estimate: 5m > Remaining Estimate: 5m > > [https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/windows.html#incremental-window-aggregation-with-reducefunction] > Example use wrong "window" argument in process() > {{JAVA example}} > {{- out.collect(new Tuple2 SensorReading>({color:#d04437}window{color}.getStart(), min));}} > {{+ out.collect(new Tuple2 SensorReading>({color:#14892c}context.window(){color}.getStart(), min));}} > > {{Scala example 2nd argument should be context:Context instead of window}} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-5905) Add user-defined aggregation functions to documentation.
[ https://issues.apache.org/jira/browse/FLINK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16127439#comment-16127439 ] Shaoxuan Wang commented on FLINK-5905: -- This issue will be solved once FLINK-6751 is merged > Add user-defined aggregation functions to documentation. > > > Key: FLINK-5905 > URL: https://issues.apache.org/jira/browse/FLINK-5905 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (FLINK-7194) Add getResultType and getAccumulatorType to AggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092983#comment-16092983 ] Shaoxuan Wang edited comment on FLINK-7194 at 7/19/17 12:15 PM: [~fhueske], I am ok with your proposal for the changes to the {{AggregateFunction}} was (Author: shaoxuanwang): [~fhueske], I am ok with your proposal for the changes to the AggregateFunction > Add getResultType and getAccumulatorType to AggregateFunction > - > > Key: FLINK-7194 > URL: https://issues.apache.org/jira/browse/FLINK-7194 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Affects Versions: 1.4.0 >Reporter: Fabian Hueske > > FLINK-6725 and FLINK-6457 proposed to remove methods with default > implementations such as {{getResultType()}}, {{toString()}}, or > {{requiresOver()}} from the base classes of user-defined methods (UDF, UDTF, > UDAGG) and instead offer them as contract methods which are dynamically > In PR [#3993|https://github.com/apache/flink/pull/3993] I argued that these > methods have a fixed signature (in contrast to the {{eval()}}, > {{accumulate()}} and {{retract()}} methods) and should be kept in the > classes. For users that don't need these methods, this doesn't make a > difference because the methods are not abstract and have a default > implementation. For users that need to override the methods it makes a > difference, because they get IDE and compiler support when overriding them > and the cannot get the signature wrong. > Consequently, I propose to add {{getResultType()}} and > {{getAccumulatorType()}} as methods with default implementation to > {{AggregateFunction}}. This will make the interface of {{AggregateFunction}} > more consistent with {{ScalarFunction}} and {{TableFunction}}. > What do you think [~shaoxuan], [~RuidongLi] and [~jark]? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7194) Add getResultType and getAccumulatorType to AggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-7194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092983#comment-16092983 ] Shaoxuan Wang commented on FLINK-7194: -- [~fhueske], I am ok with your proposal for the changes to the AggregateFunction > Add getResultType and getAccumulatorType to AggregateFunction > - > > Key: FLINK-7194 > URL: https://issues.apache.org/jira/browse/FLINK-7194 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Affects Versions: 1.4.0 >Reporter: Fabian Hueske > > FLINK-6725 and FLINK-6457 proposed to remove methods with default > implementations such as {{getResultType()}}, {{toString()}}, or > {{requiresOver()}} from the base classes of user-defined methods (UDF, UDTF, > UDAGG) and instead offer them as contract methods which are dynamically > In PR [#3993|https://github.com/apache/flink/pull/3993] I argued that these > methods have a fixed signature (in contrast to the {{eval()}}, > {{accumulate()}} and {{retract()}} methods) and should be kept in the > classes. For users that don't need these methods, this doesn't make a > difference because the methods are not abstract and have a default > implementation. For users that need to override the methods it makes a > difference, because they get IDE and compiler support when overriding them > and the cannot get the signature wrong. > Consequently, I propose to add {{getResultType()}} and > {{getAccumulatorType()}} as methods with default implementation to > {{AggregateFunction}}. This will make the interface of {{AggregateFunction}} > more consistent with {{ScalarFunction}} and {{TableFunction}}. > What do you think [~shaoxuan], [~RuidongLi] and [~jark]? -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-6751) Table API / SQL Docs: UDFs Page
[ https://issues.apache.org/jira/browse/FLINK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092970#comment-16092970 ] Shaoxuan Wang commented on FLINK-6751: -- Sorry for being no update. I will work on this in the next 3-4 days. > Table API / SQL Docs: UDFs Page > --- > > Key: FLINK-6751 > URL: https://issues.apache.org/jira/browse/FLINK-6751 > Project: Flink > Issue Type: Task > Components: Documentation, Table API & SQL >Affects Versions: 1.3.0 >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > > Update and extend the documentation of UDFs in the Table API / SQL: > {{./docs/dev/table/udfs.md}} > Missing sections: > - Registration of UDFs > - UDAGGs -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (FLINK-6725) make requiresOver as a contracted method in udagg
[ https://issues.apache.org/jira/browse/FLINK-6725?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-6725. Resolution: Won't Fix > make requiresOver as a contracted method in udagg > - > > Key: FLINK-6725 > URL: https://issues.apache.org/jira/browse/FLINK-6725 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > I realized requiresOver is defined in the udagg interface when I wrote up the > udagg doc. I would like to put requiresOver as a contract method. This makes > the entire udagg interface consistently and clean. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (FLINK-5878) Add stream-stream inner/left-out join
[ https://issues.apache.org/jira/browse/FLINK-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5878: - Summary: Add stream-stream inner/left-out join (was: Add stream-stream inner join on TableAPI) > Add stream-stream inner/left-out join > - > > Key: FLINK-5878 > URL: https://issues.apache.org/jira/browse/FLINK-5878 > Project: Flink > Issue Type: New Feature > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > This task is intended to support stream-stream inner join on tableAPI. A > brief design doc is created: > https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit > We propose to use the mapState as the backend state interface for this "join" > operator, so this task requires FLINK-4856. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (FLINK-5878) Add stream-stream inner/left-out join
[ https://issues.apache.org/jira/browse/FLINK-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-5878: Assignee: Hequn Cheng (was: Shaoxuan Wang) > Add stream-stream inner/left-out join > - > > Key: FLINK-5878 > URL: https://issues.apache.org/jira/browse/FLINK-5878 > Project: Flink > Issue Type: New Feature > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Hequn Cheng > > This task is intended to support stream-stream inner join on tableAPI. A > brief design doc is created: > https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit > We propose to use the mapState as the backend state interface for this "join" > operator, so this task requires FLINK-4856. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-7146) FLINK SQLs support DDL
[ https://issues.apache.org/jira/browse/FLINK-7146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16083614#comment-16083614 ] Shaoxuan Wang commented on FLINK-7146: -- [~yuemeng], DDL is an important interface (it is a customer interface, once we release it is not easy to revise it.) which we have been working on and polishing for couple of months. We are hesitating to propose something as there are still several design details that are under debate. We have opened FLINK-6962 for the similar purpose, will post a detailed design doc very soon. Thanks. > FLINK SQLs support DDL > -- > > Key: FLINK-7146 > URL: https://issues.apache.org/jira/browse/FLINK-7146 > Project: Flink > Issue Type: New Feature > Components: Table API & SQL >Reporter: yuemeng > > For now,Flink SQL can't support DDL, we can only register a table by call > registerTableInternal in TableEnvironment > we should support DDL for sql such as create a table or create function like: > {code} > CREATE TABLE kafka_source ( > id INT, > price INT > ) PROPERTIES ( > category = 'source', > type = 'kafka', > version = '0.9.0.1', > separator = ',', > topic = 'test', > brokers = 'xx:9092', > group_id = 'test' > ); > CREATE TABLE db_sink ( > id INT, > price DOUBLE > ) PROPERTIES ( > category = 'sink', > type = 'mysql', > table_name = 'udaf_test', > url = > 'jdbc:mysql://127.0.0.1:3308/ds?useUnicode=true&characterEncoding=UTF8', > username = 'ds_dev', > password = 's]k51_(>R' > ); > CREATE TEMPORARY function 'AVGUDAF' AS > 'com.x.server.codegen.aggregate.udaf.avg.IntegerAvgUDAF'; > INSERT INTO db_sink SELECT id ,AVGUDAF(price) FROM kafka_source group by id > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6962) SQL DDL for input and output tables
Shaoxuan Wang created FLINK-6962: Summary: SQL DDL for input and output tables Key: FLINK-6962 URL: https://issues.apache.org/jira/browse/FLINK-6962 Project: Flink Issue Type: New Feature Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: lincoln.lee Fix For: 1.4.0 This Jira adds support to allow user define the DDL for source and sink tables, including the waterMark(on source table) and emit SLA (on result table). The detailed design doc will be attached soon. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6963) User Defined Operator
Shaoxuan Wang created FLINK-6963: Summary: User Defined Operator Key: FLINK-6963 URL: https://issues.apache.org/jira/browse/FLINK-6963 Project: Flink Issue Type: New Feature Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Jark Wu Fix For: 1.4.0 We are proposing to add an User Defined Operator (UDOP) interface. As oppose to UDF(scalars to scalar)/UDTF(scalar to table)/UDAGG(table to scalar), this UDOP allows user to describe a business logic for a table to table conversion. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (FLINK-6961) Enable configurable early-firing rate
Shaoxuan Wang created FLINK-6961: Summary: Enable configurable early-firing rate Key: FLINK-6961 URL: https://issues.apache.org/jira/browse/FLINK-6961 Project: Flink Issue Type: New Feature Components: Table API & SQL Reporter: Shaoxuan Wang There are cases that we need to emit the result earlier to allow user get an observation/sample of the result. Right now, only the unbounded aggregate works in early-firing mode (in the future we will support early firing for all different scenarios, like windowed aggregate, unbounded/windowed join, etc.). But in unbounded aggregate, the result is prepared and emitted for each input. This may not be necessary, as user may not need to get the result so frequent in most cases. We create this Jira to track all the efforts (sub-jira) to enable configurable early-firing rate. It should be noted that the early-firing rate will not be exposed to the user, it will be smartly decided by the query optimizer depending on the SLA(allowed latency) of the final result. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (FLINK-5354) Split up Table API documentation into multiple pages
[ https://issues.apache.org/jira/browse/FLINK-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16027842#comment-16027842 ] Shaoxuan Wang commented on FLINK-5354: -- [~fhueske], thanks for kicking off this. I would like to work on the page for UDFs, as it seems the major missing part is UDAGG. > Split up Table API documentation into multiple pages > - > > Key: FLINK-5354 > URL: https://issues.apache.org/jira/browse/FLINK-5354 > Project: Flink > Issue Type: Improvement > Components: Documentation, Table API & SQL >Reporter: Timo Walther > > The Table API documentation page is quite large at the moment. We should > split it up into multiple pages: > Here is my suggestion: > - Overview (Datatypes, Config, Registering Tables, Examples) > - TableSources and Sinks > - Table API > - SQL > - Functions -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6751) Table API / SQL Docs: UDFs Page
[ https://issues.apache.org/jira/browse/FLINK-6751?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-6751: Assignee: Shaoxuan Wang > Table API / SQL Docs: UDFs Page > --- > > Key: FLINK-6751 > URL: https://issues.apache.org/jira/browse/FLINK-6751 > Project: Flink > Issue Type: Sub-task > Components: Documentation, Table API & SQL >Affects Versions: 1.3.0 >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > > Update and refine {{./docs/dev/table/udfs.md}} in feature branch > https://github.com/apache/flink/tree/tableDocs -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6725) make requiresOver as a contracted method in udagg
Shaoxuan Wang created FLINK-6725: Summary: make requiresOver as a contracted method in udagg Key: FLINK-6725 URL: https://issues.apache.org/jira/browse/FLINK-6725 Project: Flink Issue Type: Improvement Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang I realized requiresOver is defined in the udagg interface when I wrote up the udagg doc. I would like to put requiresOver as a contract method. This makes the entire udagg interface consistently and clean. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6637) Move registerFunction to TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-6637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-6637: - Component/s: Table API & SQL > Move registerFunction to TableEnvironment > - > > Key: FLINK-6637 > URL: https://issues.apache.org/jira/browse/FLINK-6637 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > We are trying to unify the stream and batch. This unification should cover > the tableAPI&SQL query as well as the function registration (as part of DDL). > Currently the registerFunction for UDTF and UDAGG are defined in > BatchTableEnvironment and StreamTableEnvironment separately. We should move > registerFunction to TableEnvironment. > The reason that we did not put registerFunction into TableEnvironment for > UDTF and UDAGG is that we need different registerFunction for java and scala > codes, as java needs a special way to generate and pass implicit value of > typeInfo: > {code:xml} > implicit val typeInfo: TypeInformation[T] = TypeExtractor > .createTypeInfo(tf, classOf[TableFunction[_]], tf.getClass, 0) > .asInstanceOf[TypeInformation[T]] > {code} > It seems that we need duplicate TableEnvironment class, one for java and one > for scala. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6637) Move registerFunction to TableEnvironment
Shaoxuan Wang created FLINK-6637: Summary: Move registerFunction to TableEnvironment Key: FLINK-6637 URL: https://issues.apache.org/jira/browse/FLINK-6637 Project: Flink Issue Type: Improvement Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang We are trying to unify the stream and batch. This unification should cover the tableAPI&SQL query as well as the function registration (as part of DDL). Currently the registerFunction for UDTF and UDAGG are defined in BatchTableEnvironment and StreamTableEnvironment separately. We should move registerFunction to TableEnvironment. The reason that we did not put registerFunction into TableEnvironment for UDTF and UDAGG is that we need different registerFunction for java and scala codes, as java needs a special way to generate and pass implicit value of typeInfo: {code:xml} implicit val typeInfo: TypeInformation[T] = TypeExtractor .createTypeInfo(tf, classOf[TableFunction[_]], tf.getClass, 0) .asInstanceOf[TypeInformation[T]] {code} It seems that we need duplicate TableEnvironment class, one for java and one for scala. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6544) Expose State Backend Interface for UDAGG
[ https://issues.apache.org/jira/browse/FLINK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-6544: - Summary: Expose State Backend Interface for UDAGG (was: Expose Backend State Interface for UDAGG) > Expose State Backend Interface for UDAGG > > > Key: FLINK-6544 > URL: https://issues.apache.org/jira/browse/FLINK-6544 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Kaibo Zhou >Assignee: Kaibo Zhou > > Currently UDAGG users can not access state, it's necessary to provide users > with a convenient and efficient way to access the state within the UDAGG. > This is the design doc: > https://docs.google.com/document/d/1g-wHOuFj5pMaYMJg90kVIO2IHbiQiO26nWscLIOn50c/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6544) Expose Backend State Interface for UDAGG
[ https://issues.apache.org/jira/browse/FLINK-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-6544: Assignee: Kaibo Zhou > Expose Backend State Interface for UDAGG > > > Key: FLINK-6544 > URL: https://issues.apache.org/jira/browse/FLINK-6544 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Kaibo Zhou >Assignee: Kaibo Zhou > > Currently UDAGG users can not access state, it's necessary to provide users > with a convenient and efficient way to access the state within the UDAGG. > This is the design doc: > https://docs.google.com/document/d/1g-wHOuFj5pMaYMJg90kVIO2IHbiQiO26nWscLIOn50c/edit# -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (FLINK-5433) initiate function of Aggregate does not take effect for DataStream aggregation
[ https://issues.apache.org/jira/browse/FLINK-5433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang closed FLINK-5433. Resolution: Fixed Fixed by UDAGG subtasks FLINK-5564 > initiate function of Aggregate does not take effect for DataStream aggregation > -- > > Key: FLINK-5433 > URL: https://issues.apache.org/jira/browse/FLINK-5433 > Project: Flink > Issue Type: Bug > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > The initiate function of Aggregate works for dataset aggregation, but does > not work for DataStream aggregation. > For instance, when giving an initial value, say 2, for CountAggregate. The > result of dataset aggregate will take this change into account, but > dataStream aggregate will not. > {code} > class CountAggregate extends Aggregate[Long] { > override def initiate(intermediate: Row): Unit = { > intermediate.setField(countIndex, 2L) > } > } > {code} > The output for dataset test(testWorkingAggregationDataTypes) will result in > .select('_1.avg, '_2.avg, '_3.avg, '_4.avg, '_5.avg, '_6.avg, '_7.count) > expected: [1,1,1,1,1.5,1.5,2] > received: [1,1,1,1,1.5,1.5,4] (the result of last count aggregate is bigger > than expect value by 2, as expected) > But the output for datastream > test(testProcessingTimeSlidingGroupWindowOverCount) will remain the same: > .select('string, 'int.count, 'int.avg) > Expected :List(Hello world,1,3, Hello world,2,3, Hello,1,2, Hello,2,2, Hi,1,1) > Actual :MutableList(Hello world,1,3, Hello world,2,3, Hello,1,2, Hello,2,2, > Hi,1,1) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (FLINK-5915) Add support for the aggregate on multi fields
[ https://issues.apache.org/jira/browse/FLINK-5915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang resolved FLINK-5915. -- Resolution: Fixed This is completely resolved with FLINK-5906. > Add support for the aggregate on multi fields > - > > Key: FLINK-5915 > URL: https://issues.apache.org/jira/browse/FLINK-5915 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > Fix For: 1.3.0 > > > some UDAGGs have multi-fields as input. For instance, > table > .window(Tumble over 10.minutes on 'rowtime as 'w ) > .groupBy('key, 'w) > .select('key, weightedAvg('value, 'weight)) > This task will add the support for the aggregate on multi fields. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5905) Add user-defined aggregation functions to documentation.
[ https://issues.apache.org/jira/browse/FLINK-5905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-5905: Assignee: Shaoxuan Wang > Add user-defined aggregation functions to documentation. > > > Key: FLINK-5905 > URL: https://issues.apache.org/jira/browse/FLINK-5905 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5906) Add support to register UDAGG in Table and SQL API
[ https://issues.apache.org/jira/browse/FLINK-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5906: - Summary: Add support to register UDAGG in Table and SQL API (was: Add support to register UDAGGs in TableEnvironment) > Add support to register UDAGG in Table and SQL API > -- > > Key: FLINK-5906 > URL: https://issues.apache.org/jira/browse/FLINK-5906 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6361) Finalize the AggregateFunction interface and refactoring built-in aggregates
Shaoxuan Wang created FLINK-6361: Summary: Finalize the AggregateFunction interface and refactoring built-in aggregates Key: FLINK-6361 URL: https://issues.apache.org/jira/browse/FLINK-6361 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang We have completed codeGen for all aggregate runtime functions. Now we can finalize the AggregateFunction. This includes 1) remove Accumulator trait; 2) remove accumulate, retract, merge, resetAccumulator, getAccumulatorType methods from interface, and allow them as contracted methods for UDAGG; 3) refactoring the built-in aggregates accordingly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (FLINK-5813) code generation for user-defined aggregate functions
[ https://issues.apache.org/jira/browse/FLINK-5813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang resolved FLINK-5813. -- Resolution: Fixed All required issues are solved. > code generation for user-defined aggregate functions > > > Key: FLINK-5813 > URL: https://issues.apache.org/jira/browse/FLINK-5813 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > The input and return types of the new proposed UDAGG functions are > dynamically given by the users. All these user defined functions have to be > generated via codegen. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6334) Refactoring UDTF interface
[ https://issues.apache.org/jira/browse/FLINK-6334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-6334: - Description: The current UDTF leverages the table.join(expression) interface, which is not a proper interface in terms of semantics. We would like to refactor this to let UDTF use table.join(table) interface. Very briefly, UDTF's apply method will return a Table Type, so Join(UDTF('a, 'b, ...) as 'c) shall be viewed as join(Table) (was: UDTF's apply method returns a Table Type, so Join(UDTF('a, 'b, ...) as 'c) shall be viewed as join(Table)) > Refactoring UDTF interface > -- > > Key: FLINK-6334 > URL: https://issues.apache.org/jira/browse/FLINK-6334 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Ruidong Li > > The current UDTF leverages the table.join(expression) interface, which is not > a proper interface in terms of semantics. We would like to refactor this to > let UDTF use table.join(table) interface. Very briefly, UDTF's apply method > will return a Table Type, so Join(UDTF('a, 'b, ...) as 'c) shall be viewed as > join(Table) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952127#comment-15952127 ] Shaoxuan Wang commented on FLINK-5564: -- [~fhueske], implementing the entire codeGen into one PR results in thousands lines of code changes. it is hard to be reviewed. I split FLINK-5813 into three sub-tasks. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6242) codeGen DataSet Goupingwindow Aggregates
Shaoxuan Wang created FLINK-6242: Summary: codeGen DataSet Goupingwindow Aggregates Key: FLINK-6242 URL: https://issues.apache.org/jira/browse/FLINK-6242 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6241) codeGen dataStream aggregates that use ProcessFunction
Shaoxuan Wang created FLINK-6241: Summary: codeGen dataStream aggregates that use ProcessFunction Key: FLINK-6241 URL: https://issues.apache.org/jira/browse/FLINK-6241 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6240) codeGen dataStream aggregates that use AggregateAggFunction
Shaoxuan Wang created FLINK-6240: Summary: codeGen dataStream aggregates that use AggregateAggFunction Key: FLINK-6240 URL: https://issues.apache.org/jira/browse/FLINK-6240 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6216) DataStream unbounded groupby aggregate with early firing
Shaoxuan Wang created FLINK-6216: Summary: DataStream unbounded groupby aggregate with early firing Key: FLINK-6216 URL: https://issues.apache.org/jira/browse/FLINK-6216 Project: Flink Issue Type: New Feature Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Groupby aggregate results in a replace table. For infinite groupby aggregate, we need a mechanism to define when the data should be emitted (early-fired). This task is aimed to implement the initial version of unbounded groupby aggregate, where we update and emit aggregate value per each arrived record. In the future, we will implement the mechanism and interface to let user define the frequency/period of early-firing the unbounded groupby aggregation results. The limit space of backend state is one of major obstacles for supporting unbounded groupby aggregate in practical. Due to this reason, we suggest two common (and very useful) use-cases of this unbounded groupby aggregate: 1. The range of grouping key is limit. In this case, a new arrival record will either insert to state as new record or replace the existing record in the backend state. The data in the backend state will not be evicted if the resource is properly provisioned by the user, such that we can provision the correctness on aggregation results. 2. When the grouping key is unlimited, we will not be able ensure the 100% correctness of "unbounded groupby aggregate". In this case, we will reply on the TTL mechanism of the RocksDB backend state to evicted old data such that we can provision the correct results in a certain time range. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6091) Implement and turn on the retraction for grouping window aggregate
[ https://issues.apache.org/jira/browse/FLINK-6091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-6091: Assignee: Hequn Cheng (was: Shaoxuan Wang) > Implement and turn on the retraction for grouping window aggregate > -- > > Key: FLINK-6091 > URL: https://issues.apache.org/jira/browse/FLINK-6091 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Hequn Cheng > > Implement the functions for processing retract message for grouping window > aggregate. No retract generating function needed as for now, as the current > grouping window aggregates are all executed at “without early firing mode”. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6090) Implement optimizer for retraction and turn on retraction for over window aggregate
[ https://issues.apache.org/jira/browse/FLINK-6090?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-6090: Assignee: Hequn Cheng (was: Shaoxuan Wang) > Implement optimizer for retraction and turn on retraction for over window > aggregate > --- > > Key: FLINK-6090 > URL: https://issues.apache.org/jira/browse/FLINK-6090 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Hequn Cheng > > Implement optimizer for retraction and turn on the retraction for over window > as the first prototype example: > 1.Add RetractionRule at the stage of decoration,which can derive the > replace table/append table, NeedRetraction property. > 2. Match the NeedRetraction and replace table, mark the accumulating mode; > Add the necessary retract generate function at the replace table, and add the > retract process logic at the retract consumer > 3. turn on retraction for over window aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-6089) Implement decoration phase for rewriting predicated logical plan after volcano optimization phase
[ https://issues.apache.org/jira/browse/FLINK-6089?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-6089: Assignee: Hequn Cheng (was: Shaoxuan Wang) > Implement decoration phase for rewriting predicated logical plan after > volcano optimization phase > - > > Key: FLINK-6089 > URL: https://issues.apache.org/jira/browse/FLINK-6089 > Project: Flink > Issue Type: Sub-task >Reporter: Shaoxuan Wang >Assignee: Hequn Cheng > > At present, there is no chance to modify the DataStreamRel tree after the > volcano optimization. We consider to add a decoration phase after volcano > optimization phase. Decoration phase is dedicated for rewriting predicated > logical plan and is independent of cost module. After decoration phase is > added, we get the chance to apply retraction rules at this phase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-6047) Master Jira for "Retraction for Flink Streaming"
[ https://issues.apache.org/jira/browse/FLINK-6047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-6047: - Description: [Design doc]: https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw [Introduction]: "Retraction" is an important building block for data streaming to refine the early fired results in streaming. “Early firing” are very common and widely used in many streaming scenarios, for instance “window-less” or unbounded aggregate and stream-stream inner join, windowed (with early firing) aggregate and stream-stream inner join. There are mainly two cases that require retractions: 1) update on the keyed table (the key is either a primaryKey (PK) on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic windows (e.g., session window) are in use, the new value may be replacing more than one previous window due to window merging. To the best of our knowledge, the retraction for the early fired streaming results has never been practically solved before. In this proposal, we develop a retraction solution and explain how it works for the problem of “update on the keyed table”. The same solution can be easily extended for the dynamic windows merging, as the key component of retraction - how to refine an early fired results - is the same across different problems. [Proposed Jiras]: Implement decoration phase for rewriting predicated logical plan after volcano optimization phase Implement optimizer for retraction and turn on retraction for over window aggregate Implement and turn on the retraction for grouping window aggregate Implement and turn on retraction for table source Implement and turn on retraction for table sink Implement and turn on retraction for stream-stream inner join Implement the retraction for the early firing window Implement the retraction for the dynamic window with early firing was: [Design doc]: https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw [Introduction]: "Retraction" is an important building block for data streaming to refine the early fired results in streaming. “Early firing” are very common and widely used in many streaming scenarios, for instance “window-less” or unbounded aggregate and stream-stream inner join, windowed (with early firing) aggregate and stream-stream inner join. There are mainly two cases that require retractions: 1) update on the keyed table (the key is either a primaryKey (PK) on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic windows (e.g., session window) are in use, the new value may be replacing more than one previous window due to window merging. To the best of our knowledge, the retraction for the early fired streaming results has never been practically solved before. In this proposal, we develop a retraction solution and explain how it works for the problem of “update on the keyed table”. The same solution can be easily extended for the dynamic windows merging, as the key component of retraction - how to refine an early fired results - is the same across different problems. [Proposed Jiras]: Implement decoration phase for predicated logical plan rewriting after volcano optimization phase Add source with table primary key and replace table property Add sink tableInsert and NeedRetract property Implement the retraction for partitioned unbounded over window aggregate Implement the retraction for stream-stream inner join Implement the retraction for the early firing window Implement the retraction for the dynamic window with early firing > Master Jira for "Retraction for Flink Streaming" > > > Key: FLINK-6047 > URL: https://issues.apache.org/jira/browse/FLINK-6047 > Project: Flink > Issue Type: New Feature >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > [Design doc]: > https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw > [Introduction]: > "Retraction" is an important building block for data streaming to refine the > early fired results in streaming. “Early firing” are very common and widely > used in many streaming scenarios, for instance “window-less” or unbounded > aggregate and stream-stream inner join, windowed (with early firing) > aggregate and stream-stream inner join. There are mainly two cases that > require retractions: 1) update on the keyed table (the key is either a > primaryKey (PK) on source table, or a groupKey/partitionKey in an aggregate); > 2) When dynamic windows (e.g., session window) are in use, the new value may > be replacing more than one previous window due to window merging. > To the best of our knowledge, the retraction for the early fired streaming > results has never been practically solved before. In this proposal
[jira] [Created] (FLINK-6094) Implement and turn on retraction for stream-stream inner join
Shaoxuan Wang created FLINK-6094: Summary: Implement and turn on retraction for stream-stream inner join Key: FLINK-6094 URL: https://issues.apache.org/jira/browse/FLINK-6094 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang This includes: Modify the RetractionRule to consider stream-stream inner join Implement the retract process logic for join -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6093) Implement and turn on retraction for table sink
Shaoxuan Wang created FLINK-6093: Summary: Implement and turn on retraction for table sink Key: FLINK-6093 URL: https://issues.apache.org/jira/browse/FLINK-6093 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Add sink tableInsert and NeedRetract property, and consider table sink in optimizer RetractionRule -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6092) Implement and turn on retraction for table source
Shaoxuan Wang created FLINK-6092: Summary: Implement and turn on retraction for table source Key: FLINK-6092 URL: https://issues.apache.org/jira/browse/FLINK-6092 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Add the Primary Key and replace/append properties for table source, and consider table source in optimizer RetractionRule -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6091) Implement and turn on the retraction for grouping window aggregate
Shaoxuan Wang created FLINK-6091: Summary: Implement and turn on the retraction for grouping window aggregate Key: FLINK-6091 URL: https://issues.apache.org/jira/browse/FLINK-6091 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Implement the functions for processing retract message for grouping window aggregate. No retract generating function needed as for now, as the current grouping window aggregates are all executed at “without early firing mode”. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6090) Implement optimizer for retraction and turn on retraction for over window aggregate
Shaoxuan Wang created FLINK-6090: Summary: Implement optimizer for retraction and turn on retraction for over window aggregate Key: FLINK-6090 URL: https://issues.apache.org/jira/browse/FLINK-6090 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Implement optimizer for retraction and turn on the retraction for over window as the first prototype example: 1.Add RetractionRule at the stage of decoration,which can derive the replace table/append table, NeedRetraction property. 2. Match the NeedRetraction and replace table, mark the accumulating mode; Add the necessary retract generate function at the replace table, and add the retract process logic at the retract consumer 3. turn on retraction for over window aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6089) Implement decoration phase for rewriting predicated logical plan after volcano optimization phase
Shaoxuan Wang created FLINK-6089: Summary: Implement decoration phase for rewriting predicated logical plan after volcano optimization phase Key: FLINK-6089 URL: https://issues.apache.org/jira/browse/FLINK-6089 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang At present, there is no chance to modify the DataStreamRel tree after the volcano optimization. We consider to add a decoration phase after volcano optimization phase. Decoration phase is dedicated for rewriting predicated logical plan and is independent of cost module. After decoration phase is added, we get the chance to apply retraction rules at this phase. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-6047) Master Jira for "Retraction for Flink Streaming"
Shaoxuan Wang created FLINK-6047: Summary: Master Jira for "Retraction for Flink Streaming" Key: FLINK-6047 URL: https://issues.apache.org/jira/browse/FLINK-6047 Project: Flink Issue Type: New Feature Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang [Design doc]: https://docs.google.com/document/d/18XlGPcfsGbnPSApRipJDLPg5IFNGTQjnz7emkVpZlkw [Introduction]: "Retraction" is an important building block for data streaming to refine the early fired results in streaming. “Early firing” are very common and widely used in many streaming scenarios, for instance “window-less” or unbounded aggregate and stream-stream inner join, windowed (with early firing) aggregate and stream-stream inner join. There are mainly two cases that require retractions: 1) update on the keyed table (the key is either a primaryKey (PK) on source table, or a groupKey/partitionKey in an aggregate); 2) When dynamic windows (e.g., session window) are in use, the new value may be replacing more than one previous window due to window merging. To the best of our knowledge, the retraction for the early fired streaming results has never been practically solved before. In this proposal, we develop a retraction solution and explain how it works for the problem of “update on the keyed table”. The same solution can be easily extended for the dynamic windows merging, as the key component of retraction - how to refine an early fired results - is the same across different problems. [Proposed Jiras]: Implement decoration phase for predicated logical plan rewriting after volcano optimization phase Add source with table primary key and replace table property Add sink tableInsert and NeedRetract property Implement the retraction for partitioned unbounded over window aggregate Implement the retraction for stream-stream inner join Implement the retraction for the early firing window Implement the retraction for the dynamic window with early firing -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5984) Add resetAccumulator method for AggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5984: - Summary: Add resetAccumulator method for AggregateFunction (was: Allow reusing of accumulators in AggregateFunction) > Add resetAccumulator method for AggregateFunction > - > > Key: FLINK-5984 > URL: https://issues.apache.org/jira/browse/FLINK-5984 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther >Assignee: Shaoxuan Wang > > Right now we have to create a new accumulator object if we just want to reset > it. We should allow passing the old one as a {{reuse}} object to > {{AggregateFunction#createAccumulator}}. The aggregate function then can > decide if it wants to create a new object or reset the old one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5984) Allow reusing of accumulators in AggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-5984: Assignee: Shaoxuan Wang > Allow reusing of accumulators in AggregateFunction > -- > > Key: FLINK-5984 > URL: https://issues.apache.org/jira/browse/FLINK-5984 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther >Assignee: Shaoxuan Wang > > Right now we have to create a new accumulator object if we just want to reset > it. We should allow passing the old one as a {{reuse}} object to > {{AggregateFunction#createAccumulator}}. The aggregate function then can > decide if it wants to create a new object or reset the old one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5984) Allow reusing of accumulators in AggregateFunction
[ https://issues.apache.org/jira/browse/FLINK-5984?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15900597#comment-15900597 ] Shaoxuan Wang commented on FLINK-5984: -- [~twalthr] Thanks for the suggestion. This seems a new "resetAccumulator" method to me. It is valuable for dataset, while dataStream does not need this for now. override def resetAccumulator(acc: Accumulator) = { val a = acc.asInstanceOf[SumAccumulator[T]] a.f0 = numeric.zero //sum a.f1 = false } Fabian's proposal also looks good to me, as long as we make the "reuse parameter" purely clear to the users with detailed annotations. [~twalthr], if you have not started on this jira, I can help to make the changes. > Allow reusing of accumulators in AggregateFunction > -- > > Key: FLINK-5984 > URL: https://issues.apache.org/jira/browse/FLINK-5984 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now we have to create a new accumulator object if we just want to reset > it. We should allow passing the old one as a {{reuse}} object to > {{AggregateFunction#createAccumulator}}. The aggregate function then can > decide if it wants to create a new object or reset the old one. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Issue Comment Deleted] (FLINK-5983) Replace for/foreach/map in aggregates by while loops
[ https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5983: - Comment: was deleted (was: ok, the plan sounds good to me. The runtime processing function for aggregate will be codeGened, and it is user's call to use java or scala user defined aggregate functions and if it is written in scala the user is responsible for the performance.) > Replace for/foreach/map in aggregates by while loops > > > Key: FLINK-5983 > URL: https://issues.apache.org/jira/browse/FLINK-5983 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now there is a mixture of different kinds of loops within aggregate > functions. Although performance is not the main goal at the moment, we should > focus on performant execution especially in this runtime functions. > e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}} > We should replace loops, maps etc. by primitive while loops. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops
[ https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899619#comment-15899619 ] Shaoxuan Wang commented on FLINK-5983: -- ok, the plan sounds good to me. The runtime processing function for aggregate will be codeGened, and it is user's call to use java or scala user defined aggregate functions and if it is written in scala the user is responsible for the performance. > Replace for/foreach/map in aggregates by while loops > > > Key: FLINK-5983 > URL: https://issues.apache.org/jira/browse/FLINK-5983 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now there is a mixture of different kinds of loops within aggregate > functions. Although performance is not the main goal at the moment, we should > focus on performant execution especially in this runtime functions. > e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}} > We should replace loops, maps etc. by primitive while loops. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops
[ https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899620#comment-15899620 ] Shaoxuan Wang commented on FLINK-5983: -- ok, the plan sounds good to me. The runtime processing function for aggregate will be codeGened, and it is user's call to use java or scala user defined aggregate functions and if it is written in scala the user is responsible for the performance. > Replace for/foreach/map in aggregates by while loops > > > Key: FLINK-5983 > URL: https://issues.apache.org/jira/browse/FLINK-5983 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now there is a mixture of different kinds of loops within aggregate > functions. Although performance is not the main goal at the moment, we should > focus on performant execution especially in this runtime functions. > e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}} > We should replace loops, maps etc. by primitive while loops. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops
[ https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899594#comment-15899594 ] Shaoxuan Wang commented on FLINK-5983: -- Can we consider implementing the runtime code and even the built-in aggregate functions in Java > Replace for/foreach/map in aggregates by while loops > > > Key: FLINK-5983 > URL: https://issues.apache.org/jira/browse/FLINK-5983 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now there is a mixture of different kinds of loops within aggregate > functions. Although performance is not the main goal at the moment, we should > focus on performant execution especially in this runtime functions. > e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}} > We should replace loops, maps etc. by primitive while loops. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5983) Replace for/foreach/map in aggregates by while loops
[ https://issues.apache.org/jira/browse/FLINK-5983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15899577#comment-15899577 ] Shaoxuan Wang commented on FLINK-5983: -- I love this proposal to replace scala for loops by while loops in the runtime functions. Thanks [~twalthr]. > Replace for/foreach/map in aggregates by while loops > > > Key: FLINK-5983 > URL: https://issues.apache.org/jira/browse/FLINK-5983 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Timo Walther > > Right now there is a mixture of different kinds of loops within aggregate > functions. Although performance is not the main goal at the moment, we should > focus on performant execution especially in this runtime functions. > e.g. {{DataSetTumbleCountWindowAggReduceGroupFunction}} > We should replace loops, maps etc. by primitive while loops. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5655) Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL
[ https://issues.apache.org/jira/browse/FLINK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-5655: Assignee: sunjincheng (was: Shaoxuan Wang) > Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL > > > Key: FLINK-5655 > URL: https://issues.apache.org/jira/browse/FLINK-5655 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: sunjincheng > > The goal of this issue is to add support for OVER RANGE aggregations on event > time streams to the SQL interface. > Queries similar to the following should be supported: > {code} > SELECT > a, > SUM(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' > HOUR PRECEDING AND CURRENT ROW) AS sumB, > MIN(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' > HOUR PRECEDING AND CURRENT ROW) AS minB > FROM myStream > {code} > The following restrictions should initially apply: > - All OVER clauses in the same SELECT clause must be exactly the same. > - The PARTITION BY clause is optional (no partitioning results in single > threaded execution). > - The ORDER BY clause may only have rowTime() as parameter. rowTime() is a > parameterless scalar function that just indicates processing time mode. > - UNBOUNDED PRECEDING is not supported (see FLINK-5658) > - FOLLOWING is not supported. > The restrictions will be resolved in follow up issues. If we find that some > of the restrictions are trivial to address, we can add the functionality in > this issue as well. > This issue includes: > - Design of the DataStream operator to compute OVER ROW aggregates > - Translation from Calcite's RelNode representation (LogicalProject with > RexOver expression). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5655) Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL
[ https://issues.apache.org/jira/browse/FLINK-5655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15898814#comment-15898814 ] Shaoxuan Wang commented on FLINK-5655: -- [~sunjincheng121] and I want to start the implementation on this event-time bounded over window. We will start with a design doc. Reassigning the task to sunjincheng121. > Add event time OVER RANGE BETWEEN x PRECEDING aggregation to SQL > > > Key: FLINK-5655 > URL: https://issues.apache.org/jira/browse/FLINK-5655 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: sunjincheng > > The goal of this issue is to add support for OVER RANGE aggregations on event > time streams to the SQL interface. > Queries similar to the following should be supported: > {code} > SELECT > a, > SUM(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' > HOUR PRECEDING AND CURRENT ROW) AS sumB, > MIN(b) OVER (PARTITION BY c ORDER BY rowTime() RANGE BETWEEN INTERVAL '1' > HOUR PRECEDING AND CURRENT ROW) AS minB > FROM myStream > {code} > The following restrictions should initially apply: > - All OVER clauses in the same SELECT clause must be exactly the same. > - The PARTITION BY clause is optional (no partitioning results in single > threaded execution). > - The ORDER BY clause may only have rowTime() as parameter. rowTime() is a > parameterless scalar function that just indicates processing time mode. > - UNBOUNDED PRECEDING is not supported (see FLINK-5658) > - FOLLOWING is not supported. > The restrictions will be resolved in follow up issues. If we find that some > of the restrictions are trivial to address, we can add the functionality in > this issue as well. > This issue includes: > - Design of the DataStream operator to compute OVER ROW aggregates > - Translation from Calcite's RelNode representation (LogicalProject with > RexOver expression). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5956) Add retract method into the aggregateFunction
Shaoxuan Wang created FLINK-5956: Summary: Add retract method into the aggregateFunction Key: FLINK-5956 URL: https://issues.apache.org/jira/browse/FLINK-5956 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Retraction method is help for processing updated message. It will also very helpful for window Aggregation. This PR will first add retraction methods into the aggregateFunctions, such that on-going over window Aggregation can get benefit from it. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5955) Merging a list of buffered records will have problem when ObjectReuse is turned on
Shaoxuan Wang created FLINK-5955: Summary: Merging a list of buffered records will have problem when ObjectReuse is turned on Key: FLINK-5955 URL: https://issues.apache.org/jira/browse/FLINK-5955 Project: Flink Issue Type: Bug Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Turn on ObjectReuse in MultipleProgramsTestBase: TestEnvironment clusterEnv = new TestEnvironment(cluster, 4, true); Then the tests "testEventTimeSessionGroupWindow", "testEventTimeSessionGroupWindow", and "testEventTimeTumblingGroupWindowOverTime" will fail. The reason is that we have buffered iterated records for group-merge. I think we should change the Agg merge to pair-merge, and later add group-merge when needed (in the future we should add rules to select either pair-merge or group-merge, but for now all built-in aggregates should work fine with pair-merge). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5927) Remove old Aggregate interface and built-in functions
Shaoxuan Wang created FLINK-5927: Summary: Remove old Aggregate interface and built-in functions Key: FLINK-5927 URL: https://issues.apache.org/jira/browse/FLINK-5927 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5768) Apply new aggregation functions for datastream and dataset tables
[ https://issues.apache.org/jira/browse/FLINK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5768: - Description: Apply new aggregation functions for datastream and dataset tables This includes: 1. Change the implementation of the DataStream aggregation runtime code to use new aggregation functions and aggregate dataStream API. 2. DataStream will be always running in incremental mode, as explained in 06/Feb/2017 in FLINK5564. 2. Change the implementation of the Dataset aggregation runtime code to use new aggregation functions. 3. Clean up unused class and method. was:Change the implementation of the DataStream aggregation runtime code to use new aggregation functions and aggregate dataStream API. > Apply new aggregation functions for datastream and dataset tables > - > > Key: FLINK-5768 > URL: https://issues.apache.org/jira/browse/FLINK-5768 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > Apply new aggregation functions for datastream and dataset tables > This includes: > 1. Change the implementation of the DataStream aggregation runtime code to > use new aggregation functions and aggregate dataStream API. > 2. DataStream will be always running in incremental mode, as explained in > 06/Feb/2017 in FLINK5564. > 2. Change the implementation of the Dataset aggregation runtime code to use > new aggregation functions. > 3. Clean up unused class and method. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5768) Apply new aggregation functions for datastream and dataset tables
[ https://issues.apache.org/jira/browse/FLINK-5768?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5768: - Summary: Apply new aggregation functions for datastream and dataset tables (was: Apply new aggregation functions and aggregate DataStream API for streaming tables) > Apply new aggregation functions for datastream and dataset tables > - > > Key: FLINK-5768 > URL: https://issues.apache.org/jira/browse/FLINK-5768 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > Change the implementation of the DataStream aggregation runtime code to use > new aggregation functions and aggregate dataStream API. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15884070#comment-15884070 ] Shaoxuan Wang commented on FLINK-5564: -- It is hard to separate the PRs for dataStream (FLINK-5768) and dataSet (FLINK-5769). The method we used to translate an aggregate to the Aggregate Function(transformToAggregateFunctions) is regardless of dataSet or dataStream. I would like to merge these two tasks and just provide one PR. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5915) Add support for the aggregate on multi fields
Shaoxuan Wang created FLINK-5915: Summary: Add support for the aggregate on multi fields Key: FLINK-5915 URL: https://issues.apache.org/jira/browse/FLINK-5915 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang some UDAGGs have multi-fields as input. For instance, table .window(Tumble over 10.minutes on 'rowtime as 'w ) .groupBy('key, 'w) .select('key, weightedAvg('value, 'weight)) This task will add the support for the aggregate on multi fields. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5914) remove aggregateResultType from streaming.api.datastream.aggregate
Shaoxuan Wang created FLINK-5914: Summary: remove aggregateResultType from streaming.api.datastream.aggregate Key: FLINK-5914 URL: https://issues.apache.org/jira/browse/FLINK-5914 Project: Flink Issue Type: Improvement Components: Streaming Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang aggregateResultType does not seem necessary for streaming.api.datastream.aggregate. We will anyway not serialize the aggregateResult between aggregate and window function. Aggregate function itself provides a function to getResult(), window function here should just emit the same results as aggregate output. So aggregateResultType should be same as resultType. I think we can safely remove aggregateResultType, thereby user will not have to provide two same types for the streaming.api.datastream.aggregate. [~StephanEwen], what do you think? -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (FLINK-5906) Add support to register UDAGGs in TableEnvironment
[ https://issues.apache.org/jira/browse/FLINK-5906?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang reassigned FLINK-5906: Assignee: Shaoxuan Wang > Add support to register UDAGGs in TableEnvironment > -- > > Key: FLINK-5906 > URL: https://issues.apache.org/jira/browse/FLINK-5906 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Fabian Hueske >Assignee: Shaoxuan Wang > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5899) Fix the bug in EventTimeTumblingWindow for non-partialMerge aggregate
[ https://issues.apache.org/jira/browse/FLINK-5899?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5899: - Summary: Fix the bug in EventTimeTumblingWindow for non-partialMerge aggregate (was: Fix the bug in initializing the DataSetTumbleTimeWindowAggReduceGroupFunction) > Fix the bug in EventTimeTumblingWindow for non-partialMerge aggregate > - > > Key: FLINK-5899 > URL: https://issues.apache.org/jira/browse/FLINK-5899 > Project: Flink > Issue Type: Bug > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > The row length used to initialize > DataSetTumbleTimeWindowAggReduceGroupFunction was not set properly. (I think > this is introduced by mistake when merging the code). > We currently lack the built-in non-partial-merge Aggregates. Therefore this > has not been captured by the unit test. > Reproduce step: > 1. set the "supportPartial" to false for SumAggregate > 2. Then both testAllEventTimeTumblingWindowOverTime and > testEventTimeTumblingGroupWindowOverTime will fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5900) Add non-partial merge Aggregates and unit tests
Shaoxuan Wang created FLINK-5900: Summary: Add non-partial merge Aggregates and unit tests Key: FLINK-5900 URL: https://issues.apache.org/jira/browse/FLINK-5900 Project: Flink Issue Type: Improvement Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Current built-in aggregates all support partial-merge. We are blind and not sure if the non-partial aggregate works or not. We should add non-partial merge Aggregates and unit tests. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5899) Fix the bug in initializing the DataSetTumbleTimeWindowAggReduceGroupFunction
Shaoxuan Wang created FLINK-5899: Summary: Fix the bug in initializing the DataSetTumbleTimeWindowAggReduceGroupFunction Key: FLINK-5899 URL: https://issues.apache.org/jira/browse/FLINK-5899 Project: Flink Issue Type: Bug Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang The row length used to initialize DataSetTumbleTimeWindowAggReduceGroupFunction was not set properly. (I think this is introduced by mistake when merging the code). We currently lack the built-in non-partial-merge Aggregates. Therefore this has not been captured by the unit test. Reproduce step: 1. set the "supportPartial" to false for SumAggregate 2. Then both testAllEventTimeTumblingWindowOverTime and testEventTimeTumblingGroupWindowOverTime will fail. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5878) Add stream-stream inner join on TableAPI
Shaoxuan Wang created FLINK-5878: Summary: Add stream-stream inner join on TableAPI Key: FLINK-5878 URL: https://issues.apache.org/jira/browse/FLINK-5878 Project: Flink Issue Type: New Feature Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang This task is intended to support stream-stream inner join on tableAPI. A brief design doc is created: https://docs.google.com/document/d/10oJDw-P9fImD5tc3Stwr7aGIhvem4l8uB2bjLzK72u4/edit We propose to use the mapState as the backend state interface for this "join" operator, so this task requires FLINK-4856. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (FLINK-5767) New aggregate function interface and built-in aggregate functions
[ https://issues.apache.org/jira/browse/FLINK-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shaoxuan Wang updated FLINK-5767: - Summary: New aggregate function interface and built-in aggregate functions (was: Add a new aggregate function interface) > New aggregate function interface and built-in aggregate functions > - > > Key: FLINK-5767 > URL: https://issues.apache.org/jira/browse/FLINK-5767 > Project: Flink > Issue Type: Sub-task > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > Add a new aggregate function interface. This includes implementing the > aggregate interface, migrating the existing aggregation functions to this > interface, and adding the unit tests for these functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5813) code generation for user-defined aggregate functions
Shaoxuan Wang created FLINK-5813: Summary: code generation for user-defined aggregate functions Key: FLINK-5813 URL: https://issues.apache.org/jira/browse/FLINK-5813 Project: Flink Issue Type: Sub-task Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang The input and return types of the new proposed UDAGG functions are dynamically given by the users. All these user defined functions have to be generated via codegen. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5768) Apply new aggregation functions and aggregate DataStream API for streaming tables
Shaoxuan Wang created FLINK-5768: Summary: Apply new aggregation functions and aggregate DataStream API for streaming tables Key: FLINK-5768 URL: https://issues.apache.org/jira/browse/FLINK-5768 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Change the implementation of the DataStream aggregation runtime code to use new aggregation functions and aggregate dataStream API. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5769) Apply new aggregation functions for dataset tables
Shaoxuan Wang created FLINK-5769: Summary: Apply new aggregation functions for dataset tables Key: FLINK-5769 URL: https://issues.apache.org/jira/browse/FLINK-5769 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Change the implementation of the Dataset aggregation runtime code to use new aggregation functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (FLINK-5767) Add a new aggregate function interface
Shaoxuan Wang created FLINK-5767: Summary: Add a new aggregate function interface Key: FLINK-5767 URL: https://issues.apache.org/jira/browse/FLINK-5767 Project: Flink Issue Type: Sub-task Components: Table API & SQL Reporter: Shaoxuan Wang Assignee: Shaoxuan Wang Add a new aggregate function interface. This includes implementing the aggregate interface, migrating the existing aggregation functions to this interface, and adding the unit tests for these functions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859039#comment-15859039 ] Shaoxuan Wang edited comment on FLINK-5564 at 2/9/17 5:16 AM: -- Thanks [~fhueske], Absolutely, I agree with you that it is better to separate the huge PR into some ones. (merge 1,2,3 will lead to more than 3K lines change) But I am afraid I did not completely get your suggested #1. Migrating the existing Agg without changing runtime code will lead to all Integration Test fail. One possible way is that I create new interface (say AggregateFunction) and create a few Aggs which is implemented from new interface (say intAgg extends AggregateFunction), and in step #1, I just add queryPlan tests, like what we usually did in GroupWindowTest. Is this what you are suggesting. was (Author: shaoxuanwang): Thanks [~fhueske], Absolutely, I agree with you that it is better to separate the huge PR into some ones. (merge 1,2,3 will lead to more than 3K lines change) But I am afraid I did not completely get your suggested #1. Migrating the existing Agg without changing runtime code will lead to all IntergrationTest fail. One possible way is that I create new interface (say AggregateFunction) and create a few Aggs which is implemented from new interface (say intAgg extends AggregateFunction), and in step #1, I just add queryPlan tests, like what we usually did in GroupWindowTest. Is this what you are suggesting. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859039#comment-15859039 ] Shaoxuan Wang edited comment on FLINK-5564 at 2/9/17 5:16 AM: -- Thanks [~fhueske], Absolutely, I agree with you that it is better to separate the huge PR into some ones. (merge 1,2,3 will lead to more than 3K lines change) But I am afraid I did not completely get your suggested #1. Migrating the existing Agg without changing runtime code will lead to all IntergrationTest fail. One possible way is that I create new interface (say AggregateFunction) and create a few Aggs which is implemented from new interface (say intAgg extends AggregateFunction), and in step #1, I just add queryPlan tests, like what we usually did in GroupWindowTest. Is this what you are suggesting. was (Author: shaoxuanwang): Thanks [~fhueske], Obsoletely, I agree with you that it is better to separate the huge PR into some ones. (merge 1,2,3 will lead to more than 3K lines change) But I am afraid I did not completely get your suggested #1. Migrating the existing Agg without changing runtime code will lead to all IntergrationTest fail. One possible way is that I create new interface (say AggregateFunction) and create a few Aggs which is implemented from new interface (say intAgg extends AggregateFunction), and in step #1, I just add queryPlan tests, like what we usually did in GroupWindowTest. Is this what you are suggesting. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15859039#comment-15859039 ] Shaoxuan Wang commented on FLINK-5564: -- Thanks [~fhueske], Obsoletely, I agree with you that it is better to separate the huge PR into some ones. (merge 1,2,3 will lead to more than 3K lines change) But I am afraid I did not completely get your suggested #1. Migrating the existing Agg without changing runtime code will lead to all IntergrationTest fail. One possible way is that I create new interface (say AggregateFunction) and create a few Aggs which is implemented from new interface (say intAgg extends AggregateFunction), and in step #1, I just add queryPlan tests, like what we usually did in GroupWindowTest. Is this what you are suggesting. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Comment Edited] (FLINK-5564) User Defined Aggregates
[ https://issues.apache.org/jira/browse/FLINK-5564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15854168#comment-15854168 ] Shaoxuan Wang edited comment on FLINK-5564 at 2/6/17 3:02 PM: -- Hi all, as also discussed in the dev email thread for the FLIP11 over window design, when I work on refactoring the streaming query plan, I feel that we do not need to keep the non-incremental query plan for streaming aggregation, as all the streaming aggregation should be suitable for incremental aggregate (even for max, min and median). One can choose to accumulate all records at the same time when the window is completed. But it will still execute the accumulate method to update the accumulator state for each record. The way it executes accumulate function to accumulate each record already implies that the aggregation is incremental. Whether it is accumulated once at each record arrival (incremental) or accumulated all records when the window is completed (non-incremental), really does not matter in terms of the correctness and the complexity. On the other hand, the non-incremental approach will introduce CPU jitter and latency overhead, so I would like to propose to always apply incremental mode for all streaming aggregations. was (Author: shaoxuanwang): Hi all, as also discussed in the dev email thread for the FLIP11 over window design, when I work on refactoring the streaming query plan, we do not need to keep the non-incremental query plan for streaming aggregation, as all the streaming aggregation should be suitable for incremental aggregate (even for max, min and median). One can choose to accumulate all records at the same time when the window is completed. But it will still execute the accumulate method to update the accumulator state for each record. The way it executes accumulate function to accumulate each record already implies that the aggregation is incremental. Whether it is accumulated once at each record arrival (incremental) or accumulated all records when the window is completed (non-incremental), really does not matter in terms of the correctness and the complexity. On the other hand, the non-incremental approach will introduce CPU jitter and latency overhead, so I would like to propose to always apply incremental mode for all streaming aggregations. > User Defined Aggregates > --- > > Key: FLINK-5564 > URL: https://issues.apache.org/jira/browse/FLINK-5564 > Project: Flink > Issue Type: Improvement > Components: Table API & SQL >Reporter: Shaoxuan Wang >Assignee: Shaoxuan Wang > > User-defined aggregates would be a great addition to the Table API / SQL. > The current aggregate interface is not well suited for the external users. > This issue proposes to redesign the aggregate such that we can expose an > better external UDAGG interface to the users. The detailed design proposal > can be found here: > https://docs.google.com/document/d/19JXK8jLIi8IqV9yf7hOs_Oz67yXOypY7Uh5gIOK2r-U/edit > Motivation: > 1. The current aggregate interface is not very concise to the users. One > needs to know the design details of the intermediate Row buffer before > implements an Aggregate. Seven functions are needed even for a simple Count > aggregate. > 2. Another limitation of current aggregate function is that it can only be > applied on one single column. There are many scenarios which require the > aggregate function taking multiple columns as the inputs. > 3. “Retraction” is not considered and covered in the current Aggregate. > 4. It might be very good to have a local/global aggregate query plan > optimization, which is very promising to optimize UDAGG performance in some > scenarios. > Proposed Changes: > 1. Implement an aggregate dataStream API (Done by > [FLINK-5582|https://issues.apache.org/jira/browse/FLINK-5582]) > 2. Update all the existing aggregates to use the new aggregate dataStream API > 3. Provide a better User-Defined Aggregate interface > 4. Add retraction support > 5. Add local/global aggregate -- This message was sent by Atlassian JIRA (v6.3.15#6346)