Thank you for linking that, Dongjoon!

I found SPARK-44518<https://issues.apache.org/jira/browse/SPARK-44518> in that 
list which wants to turn Spark’s Hive integration into a data source. IIUC, 
that’s very related but I’m curious if I’m thinking about this correctly:

Big gaps between built-in v1 and v2 data sources are support for bucketing and 
partitioning. And the reason v1 data sources support those is because the v1 
paths are kind of interleaved with Spark’s Hive integration. I understand 
separating that Hive integration or making it more data source-ish would put us 
closer to supporting bucketing and partitioning in v2 and then defaulting to v2.

From: Dongjoon Hyun <dongjoon.h...@gmail.com>
Date: Friday, 15 September 2023 at 05:36
To: Will Raschkowski <wraschkow...@palantir.com.invalid>
Cc: dev@spark.apache.org <dev@spark.apache.org>
Subject: Re: Plans for built-in v2 data sources in Spark 4
CAUTION: This email originates from an external party (outside of Palantir). If 
you believe this message is suspicious in nature, please use the "Report 
Message" button built into Outlook.

Hi, Will.

According to the following JIRA, as of now, there is no plan or on-going 
discussion to switch it.

https://issues.apache.org/jira/browse/SPARK-44111 
[issues.apache.org]<https://urldefense.com/v3/__https:/issues.apache.org/jira/browse/SPARK-44111__;!!NkS9JGVQ2sDq!9ClB4HvwYAfMI2IMJf1zw4UPYwDUxsnN21c3p35XbY8OQO8vCZnS-KtrRL52X6vfCnXAqFpB_jh0S5q-m5htQQyNwA4$>
 (Prepare Apache Spark 4.0.0)

Thanks,
Dongjoon.


On Wed, Sep 13, 2023 at 9:02 AM Will Raschkowski 
<wraschkow...@palantir.com.invalid> wrote:
Hey everyone,

I was wondering what the plans are for Spark's built-in v2 file data sources in 
Spark 4.

Concretely, is the plan for Spark 4 to continue defaulting to the built-in v1 
data sources? And if yes, what are the blockers for defaulting to v2? I see, 
just as example, that writing Hive-partitions is not supported in v2. Are there 
other blockers or outstanding discussions?

Regards,
Will

Reply via email to