Re: Structured Streaming & Query Planning

Jungtaek Lim Mon, 18 Mar 2019 00:40:03 -0700

Almost everything is coupled with logical plan right now, including updated
range for source in new batch, updated watermark for stateful operations,
random seed in each batch. Please refer below codes:


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala

https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/IncrementalExecution.scala

We might try out replacing these things in physical plan so that logical
plan doesn't need to be evaluated, but not sure it's feasible.

Thanks,
Jungtaek Lim (HeartSaVioR)

2019년 3월 18일 (월) 오후 4:03, Paolo Platter <paolo.plat...@agilelab.it>님이 작성:

> I can understand that if you involve columns with variable distribution in
> join operations, it may change your execution plan, but most of the time
> this is not going to happen, in streaming the most used operations are: map
> filter, grouping and stateful operations and in all these cases I can't how
> a dynamic query planning could help.
>
> It could be useful to have a parameter to force a streaming query to
> calculate the query plan just once.
>
> Paolo
>
>
>
> Ottieni Outlook per Android <https://aka.ms/ghei36>
>
> ------------------------------
> *From:* Alessandro Solimando <alessandro.solima...@gmail.com>
> *Sent:* Thursday, March 14, 2019 6:59:50 PM
> *To:* Paolo Platter
> *Cc:* user@spark.apache.org
> *Subject:* Re: Structured Streaming & Query Planning
>
> Hello Paolo,
> generally speaking, query planning is mostly based on statistics and
> distributions of data values for the involved columns, which might
> significantly change over time in a streaming context, so for me it makes a
> lot of sense that it is run at every schedule, even though I understand
> your concern.
>
> For the second question I don't know how to (or if you even can) cache the
> computed query plan.
>
> If possible, would you mind sharing your findings afterwards? (query
> planning on streaming it's a very interesting and not yet enough explored
> topic IMO)
>
> Best regards,
> Alessandro
>
> On Thu, 14 Mar 2019 at 16:51, Paolo Platter <paolo.plat...@agilelab.it>
> wrote:
>
>> Hi All,
>>
>>
>>
>> I would like to understand why in a streaming query ( that should not be
>> able to change its behaviour along iterations ) there is a
>> queryPlanning-Duration effort ( in my case is 33% of trigger interval ) at
>> every schedule. I don’t uderstand  why this is needed and if it is possible
>> to disable or cache it.
>>
>>
>>
>> Thanks
>>
>>
>>
>>
>>
>> [image: cid:image001.jpg@01D41D15.E01B6F00]
>>
>> *Paolo Platter*
>>
>> *CTO*
>>
>> E-mail:        paolo.plat...@agilelab.it
>>
>> Web Site:   www.agilelab.it
>>
>>
>>
>>
>>
>

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Structured Streaming & Query Planning

Reply via email to