Thank you for linking that, Dongjoon! I found SPARK-44518<https://issues.apache.org/jira/browse/SPARK-44518> in that list which wants to turn Spark’s Hive integration into a data source. IIUC, that’s very related but I’m curious if I’m thinking about this correctly:
Big gaps between built-in v1 and v2 data sources are support for bucketing and partitioning. And the reason v1 data sources support those is because the v1 paths are kind of interleaved with Spark’s Hive integration. I understand separating that Hive integration or making it more data source-ish would put us closer to supporting bucketing and partitioning in v2 and then defaulting to v2. From: Dongjoon Hyun <dongjoon.h...@gmail.com> Date: Friday, 15 September 2023 at 05:36 To: Will Raschkowski <wraschkow...@palantir.com.invalid> Cc: dev@spark.apache.org <dev@spark.apache.org> Subject: Re: Plans for built-in v2 data sources in Spark 4 CAUTION: This email originates from an external party (outside of Palantir). If you believe this message is suspicious in nature, please use the "Report Message" button built into Outlook. Hi, Will. According to the following JIRA, as of now, there is no plan or on-going discussion to switch it. https://issues.apache.org/jira/browse/SPARK-44111 [issues.apache.org]<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/SPARK-44111__;!!NkS9JGVQ2sDq!9ClB4HvwYAfMI2IMJf1zw4UPYwDUxsnN21c3p35XbY8OQO8vCZnS-KtrRL52X6vfCnXAqFpB_jh0S5q-m5htQQyNwA4$> (Prepare Apache Spark 4.0.0) Thanks, Dongjoon. On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski <wraschkow...@palantir.com.invalid> wrote: Hey everyone, I was wondering what the plans are for Spark's built-in v2 file data sources in Spark 4. Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 data sources? And if yes, what are the blockers for defaulting to v2? I see, just as example, that writing Hive-partitions is not supported in v2. Are there other blockers or outstanding discussions? Regards, Will