jackye1995 opened a new pull request #4401:
URL: https://github.com/apache/iceberg/pull/4401
Add a Spark procedure to generate symlink manifests, so that systems without
Iceberg support can read Iceberg table data using an external table:
```sql
CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<symlink-table-root-path>'
```
I did not add an action for this because this is really to just give a
gateway for users with any existing query engine that does not natively support
Iceberg (in my case it's Redshift Spectrum) to start using Iceberg, because
most engines support Hive with symlink input format to some extent. If we think
it deserves an action in core API, I can also add that.
The procedure looks like:
```
CALL catalog.system.generate_symlink_format_manifest(
table => 'table_name',
symlink_root_location => 's3://some/path'
);
```
The `symlink_root_location` is optional. The default is
`<table_root>/_symlink_format_manifest/<snapshot_id>`. A snapshot ID suffix is
added because if this procedure is executed twice against the same table, we
don't want to mix the results if the table is updated. If users want to use a
consistent root path for the symlink table, it could be input as an override.
I thought about adding another option for `snapshot_id` in the input, so we
can generate a symlink table for any historical snapshots, but decided to not
do that to avoid making the procedure too complicated. We can add it as a
follow up if needed.
The procedure currently returns the `snapshot_id` that the procedure is
executed against, and `data_file_count` for the number of data files in the
symlink manifests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]