Sorry these notes are so late, I didn’t get to the write up until now. As usual, if anyone has corrections or comments, please reply.
*Attendees*: John Zhuge Ryan Blue Andrew Long Wenchen Fan Gengliang Wang Russell Spitzer Yuanjian Li Yifei Huang Matt Cheah Amardeep Singh Dhilon Zhilmil Dhion Ryan Pifer *Topics*: - Should Spark require catalogs to report case sensitivity? - Bucketing and sorting survey - Add default v2 catalog: https://github.com/apache/spark/pull/24594 - SupportsNamespaces API: https://github.com/apache/spark/pull/24560 - FunctionCatalog API: https://github.com/apache/spark/pull/24559 - Skip output column resolution: https://github.com/apache/spark/pull/24469 - Move DSv2 into catalyst module: https://github.com/apache/spark/pull/24416 - Remove SupportsSaveMode: https://github.com/apache/spark/pull/24233 *Discussion*: - Wenchen: When will we add select support? - John: working in resolution. DSv2 resolution is straight-forward, the difficulty is ensuring a smooth transition from v1 to v2. - Ryan: table resolution will also be used for inserts. Once select is done, insert is next. - John: the PR may include insert as well - Add default v2 catalog: - Ryan: A default catalog is needed fro CTAS support when the source is v2 - Ryan: A pass-through v2 catalog that uses SessionCatalog should be available as the default - FunctionCatalog API: - Wenchen: this should have a design doc - Ryan: Agreed. The PR is for early discussion and prototyping. - Bucketed joins: [Ed: I don’t remember much of this, feel free to expand what was said] - Andrew: looks like lots of work to be done for bucketing. Sort removals aren’t done, bucketing with non-bucketed tables still incurs hashing costs. - Ryan: work on support for Hive bucketing appears to have stopped, so it doesn’t look like this is an easy area to improve - Where should join optimization be done? - Andrew will create a prototype PR. - Case sensitivity in catalogs: should catalogs report case sensitivity to Spark? - Ryan: catalogs connect to external systems so Spark can’t impose case sensitivity requirements. A catalog is case sensitive or not and would only be forced to violate Spark’s assumption. - Ryan: requiring a catalog to report whether it is case sensitive doesn’t actually help Spark. If the catalog is case sensitive, then Spark should pass exactly what it received to avoid changing the meaning. If the catalog is case insensitive, then Spark can pass exactly what it received because case is handled in the catalog. So Spark’s behavior doesn’t change. - Russel: not all catalogs are case sensitive or case insensitive. Some are case insensitive unless an identifier is quoted. Quoted parts are case sensitive. - Ryan: So a catalog would not be able to return true or false correctly. - Conclusion: Spark should pass identifiers that it received, without modification. -- Ryan Blue Software Engineer Netflix