[ 
https://issues.apache.org/jira/browse/COMDEV-473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bertty Contreras updated COMDEV-473:
------------------------------------
    Description: 
*Synopsis*

The current Apache Wayang (Incubating) uses a cost model to select the right 
set of platforms while optimize query plans. Often, the initial cost model 
could be ineffective after some time, and a calibration of the cost model is 
required again. The goal is to create a pipeline that starts a ML pipeline that 
starts the calibration of the cost model automatically and uses the logs of the 
previous query executions to get refine the cost model so that it follows the 
workload that interacts with the Apache Wayang (Incubating) environment.

 

*Benefits to Community*

The benefits for the community will have an AI pipeline for automatic, dynamic 
cost model calibration in query optimizers; We will use Apache 
Wayang(Incubating) as our playground. As a result, the experience of the users 
of Apache Wayang(Incubating) will improve by helping them to automatically tune 
their cost models and adatp to the current query workload.

 

*Deliverables*

The delivery expected is an adaptation for the paper "Zero-Shot Cost Models for 
Out-of-the-box Learned Cost Prediction"[1], where the authors assume an 
ML-Cost-Model. Still, in this case, the idea needs modifications to run in the 
current setup of Apache Wayang(Incubating).

 

The step expected are the following:
 * Understand the paper [1]
 * Get into the cost model of Apache Wayang
 * Discuss and design the process for the dynamic cost-model
 * Implement the feature of dynamic cost-model

 

*Related Work*

[1] [Zero-Shot Cost Models for Out-of-the-box Learned Cost 
Prediction]([https://arxiv.org/pdf/2201.00561.pdf])

[2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform 
systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])

 

{*}Biographical Information of possible mentor{*}{*}{*}

Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is 
one of the PPMC of Apache Wayang(Incubating). He has many years of experience 
developing intensive processing data systems for several industries, such as 
banking systems. He was a research engineer at the Qatar Computing Research 
Institute, where he was responsible for developing the declarative query engine 
for Rheem and adding new underlying platforms to Rheem.

 

Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of 
the PPMC of Apache Wayang(Incubating). He has many years of experience 
developing applications that support Big Data processing, with experience 
implementing ETL processes over distributed systems to optimize inventories in 
supply chains. He was a research engineer at the Qatar Computing Research 
Institute, where he specialized in human interface interaction with big data 
analytics. During this time, he co-develop an ML-based cross-platform query 
optimizer.

 

Jorge Quiané is the head of the Big Data Systems research group at the Berlin 
Institute for the Foundations of Learning and Data (BIFOLD) and a Principal 
Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of 
the IAM group at the German Research Center for ArtificialIntelligence (DFKI). 
His current research is in the broad area of big data: mainly in federated data 
analytics, scalable data infrastructures, and distributed query processing. He 
has published numerous research papers on data management and novel system 
architectures. He has recently been honoured with the 2022 ACM SIGMOD Research 
Highlight Award and the Best Paper Award at ICDE 2021 for his work on 
“EfficientControl Flow in Dataflow Systems”. He holds five patents in core 
database areas and on machine learning. Earlier in his career, he was a Senior 
Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral 
Researcher at Saarland University. He obtained his PhD in computer science from 
INRIA (Nantes University).

  was:
*Synopsis*

The current Apache Wayang (Incubating) uses a cost model to select the right 
set of platforms while optimize query plans. Often, the initial cost model 
could be ineffective after some time, and a calibration of the cost model is 
required again. The goal is to create a pipeline that starts a ML pipeline that 
starts the calibration of the cost model automatically and uses the logs of the 
previous query executions to get refine the cost model so that it follows the 
workload that interacts with the Apache Wayang (Incubating) environment.

 

*Benefits to Community*

The benefits for the community will have an AI pipeline for automatic, dynamic 
cost model calibration in query optimizers; We will use Apache 
Wayang(Incubating) as our playground. As a result, the experience of the users 
of Apache Wayang(Incubating) will improve by helping them to automatically tune 
their cost models and adatp to the current query workload.

 

*Deliverables*

The delivery expected is an adaptation for the paper "Zero-Shot Cost Models for 
Out-of-the-box Learned Cost Prediction"[1], where the authors assume an 
ML-Cost-Model. Still, in this case, the idea needs modifications to run in the 
current setup of Apache Wayang(Incubating).

 

The step expected are the following:
 * Understand the paper [1]
 * Get into the cost model of Apache Wayang
 * Discuss and design the process for the dynamic cost-model
 * Implement the feature of dynamic cost-model

 

*Related Work*

[1] [Zero-Shot Cost Models for Out-of-the-box Learned Cost 
Prediction]([https://arxiv.org/pdf/2201.00561.pdf])

[2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform 
systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])

 

*Biographical Information*

Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is 
one of the PPMC of Apache Wayang(Incubating). He has many years of experience 
developing intensive processing data systems for several industries, such as 
banking systems. He was a research engineer at the Qatar Computing Research 
Institute, where he was responsible for developing the declarative query engine 
for Rheem and adding new underlying platforms to Rheem.

 

Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of 
the PPMC of Apache Wayang(Incubating). He has many years of experience 
developing applications that support Big Data processing, with experience 
implementing ETL processes over distributed systems to optimize inventories in 
supply chains. He was a research engineer at the Qatar Computing Research 
Institute, where he specialized in human interface interaction with big data 
analytics. During this time, he co-develop an ML-based cross-platform query 
optimizer.

 

Jorge Quiané is the head of the Big Data Systems research group at the Berlin 
Institute for the Foundations of Learning and Data (BIFOLD) and a Principal 
Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of 
the IAM group at the German Research Center for ArtificialIntelligence (DFKI). 
His current research is in the broad area of big data: mainly in federated data 
analytics, scalable data infrastructures, and distributed query processing. He 
has published numerous research papers on data management and novel system 
architectures. He has recently been honoured with the 2022 ACM SIGMOD Research 
Highlight Award and the Best Paper Award at ICDE 2021 for his work on 
“EfficientControl Flow in Dataflow Systems”. He holds five patents in core 
database areas and on machine learning. Earlier in his career, he was a Senior 
Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral 
Researcher at Saarland University. He obtained his PhD in computer science from 
INRIA (Nantes University).


> Apache Wayang(Incubating): Cost Model Learner Using Machine learning
> --------------------------------------------------------------------
>
>                 Key: COMDEV-473
>                 URL: https://issues.apache.org/jira/browse/COMDEV-473
>             Project: Community Development
>          Issue Type: New Feature
>          Components: GSoC/Mentoring ideas
>            Reporter: Bertty Contreras
>            Priority: Critical
>              Labels: gsoc, gsoc2022, machine_learning
>   Original Estimate: 5h 50m
>  Remaining Estimate: 5h 50m
>
> *Synopsis*
> The current Apache Wayang (Incubating) uses a cost model to select the right 
> set of platforms while optimize query plans. Often, the initial cost model 
> could be ineffective after some time, and a calibration of the cost model is 
> required again. The goal is to create a pipeline that starts a ML pipeline 
> that starts the calibration of the cost model automatically and uses the logs 
> of the previous query executions to get refine the cost model so that it 
> follows the workload that interacts with the Apache Wayang (Incubating) 
> environment.
>  
> *Benefits to Community*
> The benefits for the community will have an AI pipeline for automatic, 
> dynamic cost model calibration in query optimizers; We will use Apache 
> Wayang(Incubating) as our playground. As a result, the experience of the 
> users of Apache Wayang(Incubating) will improve by helping them to 
> automatically tune their cost models and adatp to the current query workload.
>  
> *Deliverables*
> The delivery expected is an adaptation for the paper "Zero-Shot Cost Models 
> for Out-of-the-box Learned Cost Prediction"[1], where the authors assume an 
> ML-Cost-Model. Still, in this case, the idea needs modifications to run in 
> the current setup of Apache Wayang(Incubating).
>  
> The step expected are the following:
>  * Understand the paper [1]
>  * Get into the cost model of Apache Wayang
>  * Discuss and design the process for the dynamic cost-model
>  * Implement the feature of dynamic cost-model
>  
> *Related Work*
> [1] [Zero-Shot Cost Models for Out-of-the-box Learned Cost 
> Prediction]([https://arxiv.org/pdf/2201.00561.pdf])
> [2] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform 
> systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
>  
> {*}Biographical Information of possible mentor{*}{*}{*}
> Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is 
> one of the PPMC of Apache Wayang(Incubating). He has many years of experience 
> developing intensive processing data systems for several industries, such as 
> banking systems. He was a research engineer at the Qatar Computing Research 
> Institute, where he was responsible for developing the declarative query 
> engine for Rheem and adding new underlying platforms to Rheem.
>  
> Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one 
> of the PPMC of Apache Wayang(Incubating). He has many years of experience 
> developing applications that support Big Data processing, with experience 
> implementing ETL processes over distributed systems to optimize inventories 
> in supply chains. He was a research engineer at the Qatar Computing Research 
> Institute, where he specialized in human interface interaction with big data 
> analytics. During this time, he co-develop an ML-based cross-platform query 
> optimizer.
>  
> Jorge Quiané is the head of the Big Data Systems research group at the Berlin 
> Institute for the Foundations of Learning and Data (BIFOLD) and a Principal 
> Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of 
> the IAM group at the German Research Center for ArtificialIntelligence 
> (DFKI). His current research is in the broad area of big data: mainly in 
> federated data analytics, scalable data infrastructures, and distributed 
> query processing. He has published numerous research papers on data 
> management and novel system architectures. He has recently been honoured with 
> the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE 
> 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds 
> five patents in core database areas and on machine learning. Earlier in his 
> career, he was a Senior Scientist at the Qatar Computing Research Institute 
> (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his 
> PhD in computer science from INRIA (Nantes University).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@community.apache.org
For additional commands, e-mail: dev-h...@community.apache.org

Reply via email to