[
https://issues.apache.org/jira/browse/COMDEV-476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bertty Contreras updated COMDEV-476:
------------------------------------
Description:
*Synopsis*
The current Apache Wayang (Incubating) approach to communicate to different
platforms uses a Channel abstraction. The idea behind a channel is to generate
a source and sink operator in the platform that enables the communication
between two platforms. Yet, the currency approach does not handle any possible
bottleneck that could be created at the target platform (sink operator). For
example, when moving data from Apache Spark to PostgreSQL we could expect a
bottleneck at the PostgreSQL if the incoming is big: distributed platform vs a
single-node platform. This of course will generate problems at the PostgreSQL
side or would even cause data losses. We thus aim at poewring communication
channels with Apache Kafka support, where a queue can make the data movement
between platforms more smoothly.
*Benefits to Community*
The community benefit will be the utilization of Apache Kafka as a channel of
communication. This will be helping in adding speed to the processing of the
queries and provide a better setup and more alternatives for the optimizer.
*Deliverables*
The delivery expected is to add a robust solution for the inclusion of the
Apache Kafka as a Channel of communication.
The step expected are the following:
* Understand the paper [1]
* Get into the internals of the communication channels of Apache Wayang
(Incubating)
* Discuss and design the solution for the loop optimizations
* Implement and integrate the new Channel communications inside Apache Wayang
(Incubating).
*Related Work*
[1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform
systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
*Biographical Information of the possible mentors*
Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is
one of the PPMC of Apache Wayang(Incubating). He has many years of experience
developing intensive processing data systems for several industries, such as
banking systems. He was a research engineer at the Qatar Computing Research
Institute, where he was responsible for developing the declarative query engine
for Rheem and adding new underlying platforms to Rheem.
Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of
the PPMC of Apache Wayang(Incubating). He has many years of experience
developing applications that support Big Data processing, with experience
implementing ETL processes over distributed systems to optimize inventories in
supply chains. He was a research engineer at the Qatar Computing Research
Institute, where he specialized in human interface interaction with big data
analytics. During this time, he co-develop an ML-based cross-platform query
optimizer.
Jorge Quiané is the head of the Big Data Systems research group at the Berlin
Institute for the Foundations of Learning and Data (BIFOLD) and a Principal
Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of
the IAM group at the German Research Center for ArtificialIntelligence (DFKI).
His current research is in the broad area of big data: mainly in federated data
analytics, scalable data infrastructures, and distributed query processing. He
has published numerous research papers on data management and novel system
architectures. He has recently been honoured with the 2022 ACM SIGMOD Research
Highlight Award and the Best Paper Award at ICDE 2021 for his work on
“EfficientControl Flow in Dataflow Systems”. He holds five patents in core
database areas and on machine learning. Earlier in his career, he was a Senior
Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral
Researcher at Saarland University. He obtained his PhD in computer science from
INRIA (Nantes University).
*Name and Contact Information*
Name: Bertty Contreras-Rojas
email: bertty (at) apache.org
community: dev (at) wayang.apache.org
website: [https://wayang.apache.org|https://wayang.apache.org/]
was:
*Synopsis*
The current Apache Wayang (Incubating) approach to communicate to different
platforms uses a Channel abstraction. The idea behind a channel is to generate
a source and sink operator in the platform that enables the communication
between two platforms. Yet, the currency approach does not handle any possible
bottleneck that could be created at the target platform (sink operator). For
example, when moving data from Apache Spark to PostgreSQL we could expect a
bottleneck at the PostgreSQL if the incoming is big: distributed platform vs a
single-node platform. This of course will generate problems at the PostgreSQL
side or would even cause data losses. We thus aim at poewring communication
channels with Apache Kafka support, where a queue can make the data movement
between platforms more smoothly.
*Benefits to Community*
The community benefit will be the utilization of Apache Kafka as a channel of
communication. This will be helping in adding speed to the processing of the
queries and provide a better setup and more alternatives for the optimizer.
*Deliverables*
The delivery expected is to add a robust solution for the inclusion of the
Apache Kafka as a Channel of communication.
The step expected are the following:
* Understand the paper [1]
* Get into the internals of the communication channels of Apache Wayang
(Incubating)
* Discuss and design the solution for the loop optimizations
* Implement and integrate the new Channel communications inside Apache Wayang
(Incubating).
*Related Work*
[1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform
systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
*Biographical Information of the possible mentors*
Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is
one of the PPMC of Apache Wayang(Incubating). He has many years of experience
developing intensive processing data systems for several industries, such as
banking systems. He was a research engineer at the Qatar Computing Research
Institute, where he was responsible for developing the declarative query engine
for Rheem and adding new underlying platforms to Rheem.
Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one of
the PPMC of Apache Wayang(Incubating). He has many years of experience
developing applications that support Big Data processing, with experience
implementing ETL processes over distributed systems to optimize inventories in
supply chains. He was a research engineer at the Qatar Computing Research
Institute, where he specialized in human interface interaction with big data
analytics. During this time, he co-develop an ML-based cross-platform query
optimizer.
Jorge Quiané is the head of the Big Data Systems research group at the Berlin
Institute for the Foundations of Learning and Data (BIFOLD) and a Principal
Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of
the IAM group at the German Research Center for ArtificialIntelligence (DFKI).
His current research is in the broad area of big data: mainly in federated data
analytics, scalable data infrastructures, and distributed query processing. He
has published numerous research papers on data management and novel system
architectures. He has recently been honoured with the 2022 ACM SIGMOD Research
Highlight Award and the Best Paper Award at ICDE 2021 for his work on
“EfficientControl Flow in Dataflow Systems”. He holds five patents in core
database areas and on machine learning. Earlier in his career, he was a Senior
Scientist at the Qatar Computing Research Institute (QCRI) and a Postdoctoral
Researcher at Saarland University. He obtained his PhD in computer science from
INRIA (Nantes University).
> Apache Wayang(Incubating): New Channel communication using Apache Kafka
> -----------------------------------------------------------------------
>
> Key: COMDEV-476
> URL: https://issues.apache.org/jira/browse/COMDEV-476
> Project: Community Development
> Issue Type: New Feature
> Components: GSoC/Mentoring ideas
> Reporter: Bertty Contreras
> Priority: Critical
> Labels: gsoc, gsoc2022, machine_learning
> Original Estimate: 175h
> Remaining Estimate: 175h
>
> *Synopsis*
> The current Apache Wayang (Incubating) approach to communicate to different
> platforms uses a Channel abstraction. The idea behind a channel is to
> generate a source and sink operator in the platform that enables the
> communication between two platforms. Yet, the currency approach does not
> handle any possible bottleneck that could be created at the target platform
> (sink operator). For example, when moving data from Apache Spark to
> PostgreSQL we could expect a bottleneck at the PostgreSQL if the incoming is
> big: distributed platform vs a single-node platform. This of course will
> generate problems at the PostgreSQL side or would even cause data losses. We
> thus aim at poewring communication channels with Apache Kafka support, where
> a queue can make the data movement between platforms more smoothly.
>
> *Benefits to Community*
> The community benefit will be the utilization of Apache Kafka as a channel of
> communication. This will be helping in adding speed to the processing of the
> queries and provide a better setup and more alternatives for the optimizer.
>
> *Deliverables*
> The delivery expected is to add a robust solution for the inclusion of the
> Apache Kafka as a Channel of communication.
> The step expected are the following:
> * Understand the paper [1]
> * Get into the internals of the communication channels of Apache Wayang
> (Incubating)
> * Discuss and design the solution for the loop optimizations
> * Implement and integrate the new Channel communications inside Apache
> Wayang (Incubating).
>
> *Related Work*
> [1] [RHEEMix in the data jungle: a cost-based optimizer for cross-platform
> systems]([https://wayang.apache.org/assets/pdf/paper/journal_vldb.pdf])
>
> *Biographical Information of the possible mentors*
>
> Bertty Contreras-Rojas is a Senior Software Engineer at Databloom Inc. He is
> one of the PPMC of Apache Wayang(Incubating). He has many years of experience
> developing intensive processing data systems for several industries, such as
> banking systems. He was a research engineer at the Qatar Computing Research
> Institute, where he was responsible for developing the declarative query
> engine for Rheem and adding new underlying platforms to Rheem.
>
> Rodrigo Pardo-Meza is a Senior Software Engineer at Databloom Inc. He is one
> of the PPMC of Apache Wayang(Incubating). He has many years of experience
> developing applications that support Big Data processing, with experience
> implementing ETL processes over distributed systems to optimize inventories
> in supply chains. He was a research engineer at the Qatar Computing Research
> Institute, where he specialized in human interface interaction with big data
> analytics. During this time, he co-develop an ML-based cross-platform query
> optimizer.
>
> Jorge Quiané is the head of the Big Data Systems research group at the Berlin
> Institute for the Foundations of Learning and Data (BIFOLD) and a Principal
> Researcher at DIMA (TU Berlin). He also acts as the Scientific Coordinator of
> the IAM group at the German Research Center for ArtificialIntelligence
> (DFKI). His current research is in the broad area of big data: mainly in
> federated data analytics, scalable data infrastructures, and distributed
> query processing. He has published numerous research papers on data
> management and novel system architectures. He has recently been honoured with
> the 2022 ACM SIGMOD Research Highlight Award and the Best Paper Award at ICDE
> 2021 for his work on “EfficientControl Flow in Dataflow Systems”. He holds
> five patents in core database areas and on machine learning. Earlier in his
> career, he was a Senior Scientist at the Qatar Computing Research Institute
> (QCRI) and a Postdoctoral Researcher at Saarland University. He obtained his
> PhD in computer science from INRIA (Nantes University).
>
> *Name and Contact Information*
> Name: Bertty Contreras-Rojas
> email: bertty (at) apache.org
> community: dev (at) wayang.apache.org
> website: [https://wayang.apache.org|https://wayang.apache.org/]
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]