Re: Profiler and Cost Functions

Josep Sampe Domenech Wed, 31 May 2023 04:07:30 -0700

Thanks Jorge, this helps a lot to clarify the points related to the Genetic 
Optimizer.

I have a few additional questions on the subject matter:

  1.  Regarding The platforms: Do you consider adding support for using 
multiple Spark or Postgres instances simultaneously? I noticed there is a 
branch on GitHub dedicated to this purpose, specifically implemented for Spark. 
I'm curious to know if this is just a proof of concept or if it's something you 
plan to incorporate in the future.

  1.  Regarding the operators: In the Postgres platform, I can see the 
Executor, Filter, Projection, and TableSource operators. Currently, when I read 
two tables from Postgres and perform a JOIN operation, it appears that the JOIN 
is executed locally within the Wayang environment using the Java streams 
platform, rather than running the JOIN operation directly within Postgres 
itself. Is it because the Join operator in Postgres has not been implemented 
yet? Or is it because, based on the cost functions, it is considered more 
cost-effective to execute the JOIN locally? Or am I missing something?

  1.  Regarding the cost functions: To clarify some things related to the 
section 4 of the paper: are you considering by default the cost of moving data 
between platforms? Is the cost of moving data between platforms taken into 
account in the conversion operators, like the SqlToStreamOperator?  If so, 
Should I add a custom cost-function template in the “network” key of the 
wayang.postgres.sqltostream.load.output.template to take this data movement 
into account? Or the data transfer cost between platforms is considered in a 
different place and I should do it in a different way?

Thanks & best regards,
Josep

From: Jorge Arnulfo Quiané Ruiz <[email protected]>
Date: Friday, 26 May 2023 at 11:55
To: [email protected] <[email protected]>
Subject: [EXTERNAL] Re: Profiler and Cost Functions
Hello Josep,

Replying with a bit of delay because I have been travelling this week :)

Regarding your second point, we basically have two ways of learning the cost 
parameters of the execution operators: by analysing execution logs (using the 
genetic optimizer) or by profiling individual operators. The package you refer 
to is for the latter (profiling individual execution operators). This was our 
original idea to get the cost parameters but we quickly found out that this was 
going to be very off from the real costs because most big data platforms 
exploit operator pipelining which makes it hard to profile individually. So, 
you cannot use the output of this individual profiler for the genetic algorithm.

So, let us now discuss your first point which is regarding the Genetic 
Optimizer. So this was our solution to tackle the problem of the individual 
operator profiling approach. The genetic optimizer, instead, tries to get the 
operator costs by analysing execution logs. For this, it requires both a cost 
function template per execution operator (which should be specified in a json 
format: 
https://github.com/apache/incubator-wayang/blob/80170b543469172438bb603dd6b5fbb2bd5dae64/wayang-platforms/wayang-spark/code/main/resources/wayang-spark-defaults.properties
 ) and wayang execution logs (i.e. running jobs via Wayang). The genetic 
optimizer will learn the coefficients (denoted by ? In the template function). 
To actually understand how it does so, our VLDBJ paper (also in Arxiv) gives a 
bit more details about and a pointer for the genetic optimization we use:
https://arxiv.org/pdf/1805.03533.pdfSection  3.2 and Figure 4.

Let us know if that helps.

Best,
Jorge

> On 24 May 2023, at 11.12, Josep Sampe Domenech 
> <[email protected]> wrote:
>
> Hello dev,
>
>
>
> We recently started our exploration of the Wayang project and we would like 
> to gain a deeper understanding of the profiler tool and its functionalities, 
> specifically about the collection and use of metrics.
>
>
>
> To enhance our comprehension, we would appreciate your assistance in 
> addressing the following queries:
>
>
>
>  1.  Could you please provide us with an explanation of how the 
> GeneticOptimizerApp works? Specifically, we would like to understand which 
> information from the executions.json file is taken into consideration when 
> calculating the "?" parameters in the cost functions. Additionally, we are 
> interested in learning more about the methodology employed to calculate the 
> "?" values.
>
>
>
>  1.  We are also curious about the purpose of the profiler.spark package. 
> What is the purpose of this package? Does it serve a specific objective?, and 
> can the results obtained from this profiler.spark be utilized or integrated 
> into the GeneticOptimizerApp?
>
>
>
>
>
> Thank you in advance for your time and attention. We look forward to your 
> response.
>
>
>
> Best regards,
>
> Josep
>

Re: Profiler and Cost Functions

Reply via email to