Re: [DISCUSS] Supporting hive on DataSourceV2

Ryan Blue Mon, 23 Mar 2020 16:49:47 -0700

Hi Jacky,

We’ve internally released support for Hive tables (and Spark FileFormat
tables) using DataSourceV2 so that we can switch between catalogs; sounds
like that’s what you are planning to build as well. It would be great to
work with the broader community on a Hive connector.


I will get a branch of our connectors published so that you can take a
look. I think it should be fairly close to what you’re talking about
building, with a few exceptions:

   - Our implementation always uses our S3 committers, but it should be
   easy to change this
   - It supports per-partition formats, like Hive

Do you have an idea about where the connector should be developed? I don’t
think it makes sense for it to be part of Spark. That would keep complexity
in the main project and require updating Hive versions slowly. Using a
separate project would mean less code in Spark specific to one source, and
could more easily support multiple Hive versions. Maybe we should create a
project for catalog plug-ins?

rb

On Mon, Mar 23, 2020 at 4:20 AM JackyLee <qcsd2...@163.com> wrote:

> Hi devs,
> I’d like to start a discussion about Supporting Hive on DatasourceV2. We’re
> now working on a project using DataSourceV2 to provide multiple source
> support and it works with the data lake solution very well, yet it does not
> yet support HiveTable.
>
> There are 3 reasons why we need to support Hive on DataSourceV2.
> 1. Hive itself is one of Spark data sources.
> 2. HiveTable is essentially a FileTable with its own input and output
> formats, it works fine with FileTable.
> 3. HiveTable should be stateless, and users can freely read or write Hive
> using batch or microbatch.
>
> We implemented stateless Hive on DataSourceV1, it supports user to write
> into Hive on streaming or batch and it has widely used in our company.
> Recently, we are trying to support Hive on DataSourceV2, Multiple Hive
> Catalog and DDL Commands have already been supported.
>
> Looking forward to more discussions on this.
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: [DISCUSS] Supporting hive on DataSourceV2

Reply via email to