I can understand that if you involve columns with variable distribution in join 
operations, it may change your execution plan, but most of the time this is not 
going to happen, in streaming the most used operations are: map filter, 
grouping and stateful operations and in all these cases I can't how a dynamic 
query planning could help.

It could be useful to have a parameter to force a streaming query to calculate 
the query plan just once.

Paolo



Ottieni Outlook per Android<https://aka.ms/ghei36>

________________________________
From: Alessandro Solimando <alessandro.solima...@gmail.com>
Sent: Thursday, March 14, 2019 6:59:50 PM
To: Paolo Platter
Cc: user@spark.apache.org
Subject: Re: Structured Streaming & Query Planning

Hello Paolo,
generally speaking, query planning is mostly based on statistics and 
distributions of data values for the involved columns, which might 
significantly change over time in a streaming context, so for me it makes a lot 
of sense that it is run at every schedule, even though I understand your 
concern.

For the second question I don't know how to (or if you even can) cache the 
computed query plan.

If possible, would you mind sharing your findings afterwards? (query planning 
on streaming it's a very interesting and not yet enough explored topic IMO)

Best regards,
Alessandro

On Thu, 14 Mar 2019 at 16:51, Paolo Platter 
<paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>> wrote:
Hi All,

I would like to understand why in a streaming query ( that should not be able 
to change its behaviour along iterations ) there is a queryPlanning-Duration 
effort ( in my case is 33% of trigger interval ) at every schedule. I don’t 
uderstand  why this is needed and if it is possible to disable or cache it.

Thanks


[cid:image001.jpg@01D41D15.E01B6F00]

Paolo Platter
CTO
E-mail:        paolo.plat...@agilelab.it<mailto:paolo.plat...@agilelab.it>
Web Site:   www.agilelab.it<http://www.agilelab.it/>



Reply via email to