Thanks Jorge, this helps a lot to clarify the points related to the Genetic Optimizer.
I have a few additional questions on the subject matter: 1. Regarding The platforms: Do you consider adding support for using multiple Spark or Postgres instances simultaneously? I noticed there is a branch on GitHub dedicated to this purpose, specifically implemented for Spark. I'm curious to know if this is just a proof of concept or if it's something you plan to incorporate in the future. 1. Regarding the operators: In the Postgres platform, I can see the Executor, Filter, Projection, and TableSource operators. Currently, when I read two tables from Postgres and perform a JOIN operation, it appears that the JOIN is executed locally within the Wayang environment using the Java streams platform, rather than running the JOIN operation directly within Postgres itself. Is it because the Join operator in Postgres has not been implemented yet? Or is it because, based on the cost functions, it is considered more cost-effective to execute the JOIN locally? Or am I missing something? 1. Regarding the cost functions: To clarify some things related to the section 4 of the paper: are you considering by default the cost of moving data between platforms? Is the cost of moving data between platforms taken into account in the conversion operators, like the SqlToStreamOperator? If so, Should I add a custom cost-function template in the “network” key of the wayang.postgres.sqltostream.load.output.template to take this data movement into account? Or the data transfer cost between platforms is considered in a different place and I should do it in a different way? Thanks & best regards, Josep From: Jorge Arnulfo Quiané Ruiz <[email protected]> Date: Friday, 26 May 2023 at 11:55 To: [email protected] <[email protected]> Subject: [EXTERNAL] Re: Profiler and Cost Functions Hello Josep, Replying with a bit of delay because I have been travelling this week :) Regarding your second point, we basically have two ways of learning the cost parameters of the execution operators: by analysing execution logs (using the genetic optimizer) or by profiling individual operators. The package you refer to is for the latter (profiling individual execution operators). This was our original idea to get the cost parameters but we quickly found out that this was going to be very off from the real costs because most big data platforms exploit operator pipelining which makes it hard to profile individually. So, you cannot use the output of this individual profiler for the genetic algorithm. So, let us now discuss your first point which is regarding the Genetic Optimizer. So this was our solution to tackle the problem of the individual operator profiling approach. The genetic optimizer, instead, tries to get the operator costs by analysing execution logs. For this, it requires both a cost function template per execution operator (which should be specified in a json format: https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties ) and wayang execution logs (i.e. running jobs via Wayang). The genetic optimizer will learn the coefficients (denoted by ? In the template function). To actually understand how it does so, our VLDBJ paper (also in Arxiv) gives a bit more details about and a pointer for the genetic optimization we use: https://arxiv.org/pdf/1805.03533.pdfSection 3.2 and Figure 4. Let us know if that helps. Best, Jorge > On 24 May 2023, at 11.12, Josep Sampe Domenech > <[email protected]> wrote: > > Hello dev, > > > > We recently started our exploration of the Wayang project and we would like > to gain a deeper understanding of the profiler tool and its functionalities, > specifically about the collection and use of metrics. > > > > To enhance our comprehension, we would appreciate your assistance in > addressing the following queries: > > > > 1. Could you please provide us with an explanation of how the > GeneticOptimizerApp works? Specifically, we would like to understand which > information from the executions.json file is taken into consideration when > calculating the "?" parameters in the cost functions. Additionally, we are > interested in learning more about the methodology employed to calculate the > "?" values. > > > > 1. We are also curious about the purpose of the profiler.spark package. > What is the purpose of this package? Does it serve a specific objective?, and > can the results obtained from this profiler.spark be utilized or integrated > into the GeneticOptimizerApp? > > > > > > Thank you in advance for your time and attention. We look forward to your > response. > > > > Best regards, > > Josep >
