[GitHub] [iceberg] jackye1995 opened a new pull request #4401: Spark: add procedure to generate symlink manifests

GitBox Fri, 25 Mar 2022 00:41:00 -0700


jackye1995 opened a new pull request #4401:
URL: https://github.com/apache/iceberg/pull/4401



   Add a Spark procedure to generate symlink manifests, so that systems without 
Iceberg support can read Iceberg table data using an external table:
   
   ```sql
   CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
   [PARTITIONED BY (col_name2 col_datatype2, ...)]
   ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
   STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
   OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
   LOCATION '<symlink-table-root-path>' 
   ```
   
   I did not add an action for this because this is really to just give a 
gateway for users with any existing query engine that does not natively support 
Iceberg (in my case it's Redshift Spectrum) to start using Iceberg, because 
most engines support Hive with symlink input format to some extent. If we think 
it deserves an action in core API, I can also add that.
   
   The procedure looks like:
   
   ```
   CALL catalog.system.generate_symlink_format_manifest(
     table => 'table_name', 
     symlink_root_location => 's3://some/path'
   );
   ```
   
   The `symlink_root_location` is optional. The default is 
`<table_root>/_symlink_format_manifest/<snapshot_id>`. A snapshot ID suffix is 
added because if this procedure is executed twice against the same table, we 
don't want to mix the results if the table is updated. If users want to use a 
consistent root path for the symlink table, it could be input as an override. 
   
   I thought about adding another option for `snapshot_id` in the input, so we 
can generate a symlink table for any historical snapshots, but decided to not 
do that to avoid making the procedure too complicated. We can add it as a 
follow up if needed.
   
   The procedure currently returns the `snapshot_id` that the procedure is 
executed against, and `data_file_count` for the number of data files in the 
symlink manifests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] jackye1995 opened a new pull request #4401: Spark: add procedure to generate symlink manifests

Reply via email to