GitHub user xushiyan created a discussion: Hudi SQL standards
SQL is a primary way for people to interact with Hudi tables, therefore, a set
of standard, commonly agreed semantics and behaviors should strictly followed
and prominently listed out for everyone, both devs and users. I'm starting this
discussion to collect points about improving the semantics and behaviors of
using SQL in Hudi. I've noted down a few points to start with. With enough
points, we should start an RFC for SQL standards, and then publish it on
website as part of the tech spec.
## Timestamp format validation
Time travel and incremental queries should conform to the same set of timestamp
formats and perform strict validation before the read operation proceeds.
Currently, timestamp strings are only loosely validated against
`yyyyMMddHHmmssSSS`. A more comprehensive validation should be like:
- Hudi Timeline format (highest matching precedence): yyyyMMddHHmmssSSS or
yyyyMMddHHmmss.
- Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
- ISO 8601 format including but not limited to:
- yyyy-MM-dd'T'HH:mm:ss.SSS+00:00
- yyyy-MM-dd'T'HH:mm:ss.SSSZ
- yyyy-MM-dd'T'HH:mm:ss.SSS
- yyyy-MM-dd'T'HH:mm:ss+00:00
- yyyy-MM-dd'T'HH:mm:ssZ
- yyyy-MM-dd'T'HH:mm:ss
- yyyy-MM-dd
## Insert overwrite (table) behavior needs simplification
Currently, using INSERT OVERWRITE (TABLE) to achieve insert overwrite
partitions and the table follow this logic:
If `hoodie.datasource.write.operation` = `insert_overwrite`, INSERT OVERWRITE
(TABLE) always overwrites relevant partitions, either dynamically based on the
data, or statically based on the PARTITION() clause.
If `hoodie.datasource.write.operation` = `insert_overwrite_table`, INSERT
OVERWRITE (TABLE) always overwrites the whole table, regardless of having
PARTITION() clause or not.
If `hoodie.datasource.write.operation` is not set, it looks for
`hoodie.datasource.overwrite.mode`:
- If `hoodie.datasource.overwrite.mode` = DYNAMIC, INSERT OVERWRITE (TABLE)
always overwrites relevant partitions, either dynamically based on the data, or
statically based on the PARTITION() clause.
- If `hoodie.datasource.overwrite.mode` = STATIC,
- INSERT OVERWRITE (TABLE) PARTITION() will overwrite the specified
partition
- INSERT OVERWRITE (TABLE) (no static PARTITION clause) will always
overwrite the whole table
This involves too many configs and cases, we should simplify or re-define the
behaviors.
## Inferring merge modes and INSERT INTO write operation
- When a table is created with no `preCombineField` (ordering field), the
`INSERT INTO` implicitly works as insert operation, and merge mode is commit
time ordering.
- When a table is created with `preCombineField` (ordering field), the `INSERT
INTO` implicitly works as upsert operation, and merge mode is event time
ordering.
The inferring should be prominently listed out as part of the SQL standards.
## index management
While we expand the index support in metadata table, we need to set a clear
guidance for users to manage indexes. For indexes depending on others,
create/drop indexes should be carefully guarded. For example, when record index
and secondary indexes have been created, users cannot drop record index before
dropping all secondary indexes.
Index names: current `record_index` name is reserved for creating record index,
which is not obvious to users. We should allow custom names for indexes, while
behind the scene, each created index has another property indicating its type.
So no reserved name, just fail when repeated name is given by user when
creating an index.
What are the SQL commands that you think need clearer definition or
simplification? Please add your thoughts too.
GitHub link: https://github.com/apache/hudi/discussions/13815
----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]