GitHub user xushiyan created a discussion: Hudi SQL standards

SQL is a primary way for people to interact with Hudi tables, therefore, a set 
of standard,  commonly agreed semantics and behaviors should strictly followed 
and prominently listed out for everyone, both devs and users. I'm starting this 
discussion to collect points about improving the semantics and behaviors of 
using SQL in Hudi. I've noted down a few points to start with. With enough 
points, we should start an RFC for SQL standards, and then publish it on 
website as part of the tech spec.

## Timestamp format validation

Time travel and incremental queries should conform to the same set of timestamp 
formats and perform strict validation before the read operation proceeds. 
Currently, timestamp strings are only loosely validated against 
`yyyyMMddHHmmssSSS`. A more comprehensive validation should be like:
- Hudi Timeline format (highest matching precedence): yyyyMMddHHmmssSSS or 
yyyyMMddHHmmss.
- Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
- ISO 8601 format including but not limited to:
  - yyyy-MM-dd'T'HH:mm:ss.SSS+00:00
  - yyyy-MM-dd'T'HH:mm:ss.SSSZ
  - yyyy-MM-dd'T'HH:mm:ss.SSS
  - yyyy-MM-dd'T'HH:mm:ss+00:00
  - yyyy-MM-dd'T'HH:mm:ssZ
  - yyyy-MM-dd'T'HH:mm:ss
  - yyyy-MM-dd

## Insert overwrite (table) behavior needs simplification

Currently, using INSERT OVERWRITE (TABLE) to achieve insert overwrite 
partitions and the table follow this logic:

If `hoodie.datasource.write.operation` = `insert_overwrite`, INSERT OVERWRITE 
(TABLE) always overwrites relevant partitions, either dynamically based on the 
data, or statically based on the PARTITION() clause.
If `hoodie.datasource.write.operation` = `insert_overwrite_table`, INSERT 
OVERWRITE (TABLE) always overwrites the whole table, regardless of having 
PARTITION() clause or not.

If `hoodie.datasource.write.operation` is not set, it looks for 
`hoodie.datasource.overwrite.mode`:
  - If `hoodie.datasource.overwrite.mode` = DYNAMIC, INSERT OVERWRITE (TABLE) 
always overwrites relevant partitions, either dynamically based on the data, or 
statically based on the PARTITION() clause.
  - If `hoodie.datasource.overwrite.mode` = STATIC, 
    - INSERT OVERWRITE (TABLE) PARTITION() will overwrite the specified 
partition
    - INSERT OVERWRITE (TABLE) (no static PARTITION clause) will always 
overwrite the whole table

This involves too many configs and cases, we should simplify or re-define the 
behaviors.

## Inferring merge modes and INSERT INTO write operation

- When a table is created with no `preCombineField` (ordering field), the 
`INSERT INTO` implicitly works as insert operation, and merge mode is commit 
time ordering.
- When a table is created with `preCombineField` (ordering field), the `INSERT 
INTO` implicitly works as upsert operation, and merge mode is event time 
ordering.

The inferring should be prominently listed out as part of the SQL standards.

## index management

While we expand the index support in metadata table, we need to set a clear 
guidance for users to manage indexes. For indexes depending on others, 
create/drop indexes should be carefully guarded. For example, when record index 
and secondary indexes have been created, users cannot drop record index before 
dropping all secondary indexes.

Index names: current `record_index` name is reserved for creating record index, 
which is not obvious to users. We should allow custom names for indexes, while 
behind the scene, each created index has another property indicating its type. 
So no reserved name, just fail when repeated name is given by user when 
creating an index.


What are the SQL commands that you think need clearer definition or 
simplification? Please add your thoughts too.

GitHub link: https://github.com/apache/hudi/discussions/13815

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to