DataSourceV2 community sync notes - 1 May 2019

Ryan Blue Mon, 06 May 2019 10:15:11 -0700

Here are my notes for the latest DSv2 community sync. As usual, if you have
comments or corrections, please reply. If you’d like to be invited to the
next sync, email me directly. Everyone is welcome to attend.


*Attendees*:
Ryan Blue
John Zhuge
Andrew Long
Bruce Robbins
Dilip Biswal
Gengliang Wang
Kevin Yu
Michael Artz
Russel Spitzer
Yifei Huang
Zhilmil Dhillon

*Topics*:

Introductions
Open pull requests
V2 organization
Bucketing and sort order from v2 sources

*Discussion*:

   - Introductions: we stopped doing introductions when we had a large
   group. Attendance has gone down from the first few syncs, so we decided to
   resume.
   - V2 organization / PR #24416: https://github.com/apache/spark/pull/24416
      - Ryan: There’s an open PR to move v2 into catalyst
      - Andrew: What is the distinction between catalyst and sql? How do we
      know what goes where?
      - Ryan: IIUC, the catalyst module is supposed to be a stand-alone
      query planner that doesn’t get into concrete physical plans. The catalyst
      package is the private implementation. Anything that is generic catalyst,
      including APIs like DataType, should be in the catalyst module. Anything
      public, like an API, should not be in the catalyst package.
      - Ryan: v2 meets those requirements now and I don’t have a strong
      opinion on organization. We just need to choose one.
      - No one had a strong opinion so we tabled this. In #24416 or shortly
      after let’s decide on organization and do the move at once.
      - Next steps: someone with an opinion on organization should suggest
      a structure.
   - TableCatalog API / PR #24246:
   https://github.com/apache/spark/pull/24246
      - Ryan: Wenchen’s last comment was that he was waiting for tests to
      pass and they are. Maybe this will be merged soon?
   - Bucketing and sort order from v2 sources
      - Andrew: interested in using existing data layout and sorting to
      remove expensive tasks in joins
      - Ryan: in v2, bucketing is unified with other partitioning
      functions. I plan to build a way for Spark to get partition function
      implementations from a source so it can use that function to prepare the
      other side of a join. From there, I have been thinking about a
way to check
      compatibility between functions, so we could validate that table
A has the
      same bucketing as table B.
      - Dilip: is bucketing Hive-specific?
      - Russel: Cassandra also buckets
      - Matt: what is the difference between bucketing and other partition
      functions for this?
      - Ryan: probably no difference. If you’re partitioning by hour, you
      could probably use that, too.
      - Dilip: how can Spark compare functions?
      - Andrew: should we introduce a standard? it would be easy to switch
      for their use case
      - Ryan: it is difficult to introduce a standard because so much data
      already exists in tables. I think it is easier to support multiple
      functions.
      - Russel: Cassandra uses dynamic bucketing, which wouldn’t be able to
      use a standard.
      - Dilip: sources could push down joins
      - Russel: that’s a harder problem
      - Andrew: does anyone else limit bucket size?
      - Ryan: we don’t because we assume the sort can spill. probably a
      good optimization for later
      - Matt: what are the follow-up items for this?
      - Andrew: will look into the current state of bucketing in Spark
      - Ryan: it would be great if someone thought about what the
      FunctionCatalog interface will look like

-- 
Ryan Blue
Software Engineer
Netflix

DataSourceV2 community sync notes - 1 May 2019

Reply via email to