Surfacing Iceberg in Spark SQL versions 2.x

Harish Butani Mon, 16 Sep 2019 12:17:04 -0700

Hi,

I have done some initial work on this. https://github.com/hbutani/icebergSQL 
<https://github.com/hbutani/icebergSQL>
See README <https://github.com/hbutani/icebergSQL/blob/master/README.md>, 
example 
<https://github.com/hbutani/icebergSQL/blob/master/docs/basicExample.sql> for 
details.


Goal is to provide the following for DataSource V1 tables:
allow users to create managed tables and define source column to partition 
column transformations as table options.
have SQL insert statements create new Iceberg Table snapshots
have SQL select statements leverage Iceberg Table snapshots for partition and 
file pruning
provide a new 'as of' clause to the sql select statement to run a query against 
a particular snapshot of a managed table.
extend Spark SQL with Iceberg management views and statements to view and 
manage the snapshots of a managed table.

Reason for this:
Our experience is that a  lot of deployments that use V1 datasource tables can 
benefit from Iceberg. So we focus on Spark 2.x; repo is at 2.4.4, but easy to 
back port to 2.3.x,2.2.x.
I see there is work going on to surface Iceberg Table Management as a V2 
Datasource table <https://databricks.com/session/apache-spark-data-source-v2>, 
but as far as I can tell V2 Datasources SQL integration is still in the works.

Looking for feedback from iceberg community.

Regards,
Harish Butani.

(Please cc my email on any replies, I am not subscribed to iceberg dev)

Surfacing Iceberg in Spark SQL versions 2.x

Reply via email to