Dear Josep, Let me try to answer these. Please see my response inline below
On Wed, May 31, 2023 at 4:37 PM Josep Sampe Domenech <[email protected]> wrote: > Thanks Jorge, this helps a lot to clarify the points related to the > Genetic Optimizer. > > > I have a few additional questions on the subject matter: > > > 1. Regarding The platforms: Do you consider adding support for using > multiple Spark or Postgres instances simultaneously? I noticed there is a > branch on GitHub dedicated to this purpose, specifically implemented for > Spark. I'm curious to know if this is just a proof of concept or if it's > something you plan to incorporate in the future. > > In theory, Wayang can support multiple instances of the same platform. However, this would require a unique identifier for each platform and subsequent changes. This is very much in our scheme of things for the near future. > > 1. Regarding the operators: In the Postgres platform, I can see the > Executor, Filter, Projection, and TableSource operators. Currently, when I > read two tables from Postgres and perform a JOIN operation, it appears that > the JOIN is executed locally within the Wayang environment using the Java > streams platform, rather than running the JOIN operation directly within > Postgres itself. Is it because the Join operator in Postgres has not been > implemented yet? Or is it because, based on the cost functions, it is > considered more cost-effective to execute the JOIN locally? Or am I missing > something? > In this case, the join operator is not yet implemented. We are in the process of supporting join pushdowns as a part of Wayang SQL API. > > > > 1. Regarding the cost functions: To clarify some things related to the > section 4 of the paper: are you considering by default the cost of moving > data between platforms? Is the cost of moving data between platforms taken > into account in the conversion operators, like the SqlToStreamOperator? If > so, Should I add a custom cost-function template in the “network” key of > the wayang.postgres.sqltostream.load.output.template to take this data > movement into account? Or the data transfer cost between platforms is > considered in a different place and I should do it in a different way? > I am not 100% sure about this but https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-commons/wayang-core/src/main/java/org/apache/wayang/core/optimizer/channels/DefaultChannelConversion.java#L181 could be a pointer. Best, Kaustubh > > Thanks & best regards, > Josep > > > From: Jorge Arnulfo Quiané Ruiz <[email protected]> > Date: Friday, 26 May 2023 at 11:55 > To: [email protected] <[email protected]> > Subject: [EXTERNAL] Re: Profiler and Cost Functions > Hello Josep, > > Replying with a bit of delay because I have been travelling this week :) > > Regarding your second point, we basically have two ways of learning the > cost parameters of the execution operators: by analysing execution logs > (using the genetic optimizer) or by profiling individual operators. The > package you refer to is for the latter (profiling individual execution > operators). This was our original idea to get the cost parameters but we > quickly found out that this was going to be very off from the real costs > because most big data platforms exploit operator pipelining which makes it > hard to profile individually. So, you cannot use the output of this > individual profiler for the genetic algorithm. > > So, let us now discuss your first point which is regarding the Genetic > Optimizer. So this was our solution to tackle the problem of the individual > operator profiling approach. The genetic optimizer, instead, tries to get > the operator costs by analysing execution logs. For this, it requires both > a cost function template per execution operator (which should be specified > in a json format: > https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties > ) and wayang execution logs (i.e. running jobs via Wayang). The genetic > optimizer will learn the coefficients (denoted by ? In the template > function). To actually understand how it does so, our VLDBJ paper (also in > Arxiv) gives a bit more details about and a pointer for the genetic > optimization we use: > https://arxiv.org/pdf/1805.03533.pdfSection 3.2 and Figure 4. > > Let us know if that helps. > > Best, > Jorge > > > On 24 May 2023, at 11.12, Josep Sampe Domenech < > [email protected]> wrote: > > > > Hello dev, > > > > > > > > We recently started our exploration of the Wayang project and we would > like to gain a deeper understanding of the profiler tool and its > functionalities, specifically about the collection and use of metrics. > > > > > > > > To enhance our comprehension, we would appreciate your assistance in > addressing the following queries: > > > > > > > > 1. Could you please provide us with an explanation of how the > GeneticOptimizerApp works? Specifically, we would like to understand which > information from the executions.json file is taken into consideration when > calculating the "?" parameters in the cost functions. Additionally, we are > interested in learning more about the methodology employed to calculate the > "?" values. > > > > > > > > 1. We are also curious about the purpose of the profiler.spark > package. What is the purpose of this package? Does it serve a specific > objective?, and can the results obtained from this profiler.spark be > utilized or integrated into the GeneticOptimizerApp? > > > > > > > > > > > > Thank you in advance for your time and attention. We look forward to > your response. > > > > > > > > Best regards, > > > > Josep > > >
