I agree that the long-term solution is much farther away, but I'm not sure it is a good idea to do this in the optimizer. Maybe we could find a good way to do it, but the initial complication required before we moved to push-down to the conversion to physical plan was really bad. Plus, this has been outstanding for probably a year now, so I am not confident that the long-term solution would be a priority -- it seems to me that band-aid solutions persist for far too long.
On Tue, Jul 23, 2019 at 4:30 AM Wenchen Fan <cloud0...@gmail.com> wrote: > Hi Ryan, > > Thanks for summarizing and sending out the meeting notes! Unfortunately, I > missed the last sync, but the topics are really interesting, especially the > stats integration. > > The ideal solution I can think of is to refactor the optimizer/planner and > move all the stats-based optimization to the physical plan phase (or do it > during the planning). This needs a lot of design work and I'm not sure if > we can finish it in the near future. > > Alternatively, we can do the operator pushdown at logical plan phase via > the optimizer. This is not ideal but I think is a better workaround than > doing pushdown twice. The parquet nested column pruning is also done at the > logical plan phase, so I think there are no serious problems if we do > operator pushdown at the logical plan phase. > > This is only about the internal implementation so we can fix it at any > time. But this may hurt data source v2 performance a lot and we'd better > fix it sooner rather than later. > > > On Sat, Jul 20, 2019 at 8:20 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Here are my notes from the last sync. If you’d like to be added to the >> invite or have topics, please let me know. >> >> *Attendees*: >> >> Ryan Blue >> Matt Cheah >> Yifei Huang >> Jose Torres >> Burak Yavuz >> Gengliang Wang >> Michael Artz >> Russel Spitzer >> >> *Topics*: >> >> - Existing PRs >> - V2 session catalog: https://github.com/apache/spark/pull/24768 >> - REPLACE and RTAS: https://github.com/apache/spark/pull/24798 >> - DESCRIBE TABLE: https://github.com/apache/spark/pull/25040 >> - ALTER TABLE: https://github.com/apache/spark/pull/24937 >> - INSERT INTO: https://github.com/apache/spark/pull/24832 >> - Stats integration >> - CTAS and DataFrameWriter behavior >> >> *Discussion*: >> >> - ALTER TABLE PR is ready to commit (and was after the sync) >> - REPLACE and RTAS PR: waiting on more reviews >> - INSERT INTO PR: Ryan will review >> - DESCRIBE TABLE has test failures, Matt will fix >> - V2 session catalog: >> - How will v2 catalog be configured? >> - Ryan: This is up for discussion because it currently uses a >> table property. I think it needs to be configurable >> - Burak: Agree that it should be configurable >> - Ryan: Does this need to be determined now, or can we solve this >> after getting the functionality in? >> - Jose: let’s get it in and fix it later >> - Stats integration: >> - Matt: has anyone looked at stats integration? What needs to be >> done? >> - Ryan: stats are part of the Scan API. Configure a scan with >> ScanBuilder and then get stats from it. The problem is that this >> happens >> when converting to physical plan, after the optimizer. But the >> optimizer >> determines what gets broadcasted. A work-around Netflix uses is to run >> push >> down in the stats code. This runs push-down twice and was rejected from >> Spark, but is important for performance. We should add a property to >> enable >> this. >> - Ryan: The larger problem is that stats are used in the >> optimizer, but push-down happens when converting to physical plan. >> This is >> also related to our earlier discussions about when join types are >> chosen. >> Fixing this is a big project >> - CTAS and DataFrameWriter behavior >> - Burak: DataFrameWriter uses CTAS where it shouldn’t. It is >> difficult to predict v1 behavior >> - Ryan: Agree, v1 DataFrameWriter does not have clear behavior. We >> suggest a replacement with clear verbs for each SQL action: >> append/insert, >> overwrite, overwriteDynamic, create (table), replace (table) >> - Ryan: Prototype available here: >> https://gist.github.com/rdblue/6bc140a575fdf266beb2710ad9dbed8f >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix