GitHub user rahil-c created a discussion: RFC-99: Hudi Type System
## Background Wanted to start a discussion around the topic @balaji-varadarajan-ai proposal RFC 99 https://github.com/apache/hudi/pull/13743 around introducing a `native type system` within Apache Hudi, as well as what would be an initial first step toward an MVP. For Hudi 1.2.0 we want to be able to let users in the AI/ML space to be able to define some way of representing a "BLOB-like" content (that would encompass the binary content of a image, video, audio) as well store the vector embeddings for these pieces of data in order to perform a similarity search(for more details on see RFC 102: https://github.com/apache/hudi/pull/14218). We will need to ensure our type system can account for those. We also might need to specify some granularity around how large the content this binary content maybe in the case, as well as for vectors what are the dimensions(@balaji-varadarajan-ai RFC captured these details as well) ## What types to start with from RFC 99? I think the following initial types should be the first step that we support to cover both structured and unstructured use cases within a hudi table. From @balaji-varadarajan-ai RFC I am thinking a first step would be first supporting these types, if we feel we should add another type for initial MVP or to cut down this list feel free to leave a comment. ## Primitive Types | Logical Type | Description | Parameters | | :---- | :---- | :---- | | BOOLEAN | A logical boolean value (true/false). | None | | INTEGER | A 32-bit signed integer. | None | | BIGINT | A 64-bit signed integer. | None | | FLOAT16 | A 16-bit half-precision floating-point number. | None | | FLOAT | A 32-bit single-precision floating-point number. | None | | DOUBLE | A 64-bit double-precision floating-point number. | None | | DECIMAL(p, s) | An exact numeric with specified precision/scale. | p, s | | STRING | A variable-length UTF-8 character string, limited to 2GB per value. | None | | LARGE\_STRING | A variable-length UTF-8 character string for values exceeding 2GB. | None | | FIXED(n) | A fixed-length sequence of n bytes. | n | ## AI/ML types | Logical Type | Description | Parameters | | :---- | :---- | :---- | | VECTOR(element\_type, dimension) | A dense, fixed-length vector of numeric values. | Element type, dimension | | BINARY | A variable-length sequence of bytes, limited to 2GB per value. | None | | LARGE\_BINARY | A variable-length sequence of bytes for values exceeding 2GB. | None | ## Nested Types | Logical Type | Description | Parameters | | :---- | :---- | :---- | | STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field list | | LIST\<element\_type\> | An ordered list of elements of the same type. | Element type | | MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must be unique. | Key, Value types | ## Temporal types | Logical Type | Description | Parameters | | :---- | :---- | :---- | | DATE | A calendar date (year, month, day). | None | | DATE64 | A calendar date stored as milliseconds. | None | | TIME(precision) | A time of day without a timezone. | s, ms, us, ns | | TIMESTAMP(precision) | An instant in time without a timezone. | us or ns | | TIMESTAMPTZ(precision) | An instant in time with a timezone, normalized and stored as UTC. | us or ns | ## What should the type system be backed by? #### Option 1 Currently when looking at other table format projects in the space some such as Apache Iceberg, the approach they take is defining a native type system https://iceberg.apache.org/spec/#primitive-types which is not backed by any particular file format such as (Parquet, Avro, Arrow). https://iceberg.apache.org/spec/?h=parquet+types#parquet. One option is to follow a similar approach of defining our own type system as Java constructs: https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java and then have different engines and file formats have to convert between our representation. See an example such as this https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java#L68 #### Option 2 Similar to how we use Avro today within the hudi project for schema representation we instead leverage Apache Arrow and use arrow types as first class citizens within the project. For example when looking at Lance, they currently do not have a type system and instead leverage arrow directly. ``` Corollary 1: There is No Type System[#](https://lancedb.com/blog/lance-v2/#corollary-1-there-is-no-type-system) The Lance format itself does not have a type system. From Lance’s perspective, every column is simply a collection of pages (with an encoding) and each page a collection of buffers. Of course, the Lance readers and writers will actually need to convert between these pages into typed arrays of some kind. The readers and writers we have written use the [Arrow type system](https://arrow.apache.org/docs/format/Columnar.html) . ``` GitHub link: https://github.com/apache/hudi/discussions/14253 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
