Re: DataSourceV2 sync notes - 10 July 2019

Ryan Blue Tue, 23 Jul 2019 17:06:07 -0700

I agree that the long-term solution is much farther away, but I'm not sure
it is a good idea to do this in the optimizer. Maybe we could find a good
way to do it, but the initial complication required before we moved to
push-down to the conversion to physical plan was really bad. Plus, this has
been outstanding for probably a year now, so I am not confident that the
long-term solution would be a priority -- it seems to me that band-aid
solutions persist for far too long.


On Tue, Jul 23, 2019 at 4:30 AM Wenchen Fan <cloud0...@gmail.com> wrote:

> Hi Ryan,
>
> Thanks for summarizing and sending out the meeting notes! Unfortunately, I
> missed the last sync, but the topics are really interesting, especially the
> stats integration.
>
> The ideal solution I can think of is to refactor the optimizer/planner and
> move all the stats-based optimization to the physical plan phase (or do it
> during the planning). This needs a lot of design work and I'm not sure if
> we can finish it in the near future.
>
> Alternatively, we can do the operator pushdown at logical plan phase via
> the optimizer. This is not ideal but I think is a better workaround than
> doing pushdown twice. The parquet nested column pruning is also done at the
> logical plan phase, so I think there are no serious problems if we do
> operator pushdown at the logical plan phase.
>
> This is only about the internal implementation so we can fix it at any
> time. But this may hurt data source v2 performance a lot and we'd better
> fix it sooner rather than later.
>
>
> On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> Here are my notes from the last sync. If you’d like to be added to the
>> invite or have topics, please let me know.
>>
>> *Attendees*:
>>
>> Ryan Blue
>> Matt Cheah
>> Yifei Huang
>> Jose Torres
>> Burak Yavuz
>> Gengliang Wang
>> Michael Artz
>> Russel Spitzer
>>
>> *Topics*:
>>
>>    - Existing PRs
>>       - V2 session catalog: https://github.com/apache/spark/pull/24768
>>       - REPLACE and RTAS: https://github.com/apache/spark/pull/24798
>>       - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040
>>       - ALTER TABLE: https://github.com/apache/spark/pull/24937
>>       - INSERT INTO: https://github.com/apache/spark/pull/24832
>>    - Stats integration
>>    - CTAS and DataFrameWriter behavior
>>
>> *Discussion*:
>>
>>    - ALTER TABLE PR is ready to commit (and was after the sync)
>>    - REPLACE and RTAS PR: waiting on more reviews
>>    - INSERT INTO PR: Ryan will review
>>    - DESCRIBE TABLE has test failures, Matt will fix
>>    - V2 session catalog:
>>       - How will v2 catalog be configured?
>>       - Ryan: This is up for discussion because it currently uses a
>>       table property. I think it needs to be configurable
>>       - Burak: Agree that it should be configurable
>>       - Ryan: Does this need to be determined now, or can we solve this
>>       after getting the functionality in?
>>       - Jose: let’s get it in and fix it later
>>    - Stats integration:
>>       - Matt: has anyone looked at stats integration? What needs to be
>>       done?
>>       - Ryan: stats are part of the Scan API. Configure a scan with
>>       ScanBuilder and then get stats from it. The problem is that this 
>> happens
>>       when converting to physical plan, after the optimizer. But the 
>> optimizer
>>       determines what gets broadcasted. A work-around Netflix uses is to run 
>> push
>>       down in the stats code. This runs push-down twice and was rejected from
>>       Spark, but is important for performance. We should add a property to 
>> enable
>>       this.
>>       - Ryan: The larger problem is that stats are used in the
>>       optimizer, but push-down happens when converting to physical plan. 
>> This is
>>       also related to our earlier discussions about when join types are 
>> chosen.
>>       Fixing this is a big project
>>    - CTAS and DataFrameWriter behavior
>>       - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is
>>       difficult to predict v1 behavior
>>       - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We
>>       suggest a replacement with clear verbs for each SQL action: 
>> append/insert,
>>       overwrite, overwriteDynamic, create (table), replace (table)
>>       - Ryan: Prototype available here:
>>       https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: DataSourceV2 sync notes - 10 July 2019

Reply via email to