GitHub user rahil-c created a discussion: RFC-99: Hudi Type System

## Background
Wanted to start a discussion around the topic @balaji-varadarajan-ai proposal 
RFC 99 https://github.com/apache/hudi/pull/13743 around introducing a `native 
type system` within Apache Hudi, as well as what would be an initial first step 
toward an MVP.

For Hudi 1.2.0 we want to be able to let users in the AI/ML space to be able to 
define some way of representing a "BLOB-like" content (that would encompass the 
binary content of a image, video, audio) as well store the vector embeddings 
for these pieces of data in order to perform a similarity search(for more 
details on see RFC 102: https://github.com/apache/hudi/pull/14218). We will 
need to ensure our type system can account for those. We also might need to 
specify some granularity around how large the content this binary content maybe 
in the case, as well as for vectors what are the 
dimensions(@balaji-varadarajan-ai RFC captured these details as well)

## What types to start with from RFC 99?
I think the following initial types should be the first step that we support to 
cover both structured and unstructured use cases within a hudi table. From 
@balaji-varadarajan-ai RFC I am thinking a first step would be first supporting 
these types, if we feel we should add another type for initial MVP or to cut 
down this list feel free to leave a comment.


## Primitive Types
| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| BOOLEAN | A logical boolean value (true/false). | None |
| INTEGER | A 32-bit signed integer. | None |
| BIGINT | A 64-bit signed integer. | None |
| FLOAT16 | A 16-bit half-precision floating-point number. | None |
| FLOAT | A 32-bit single-precision floating-point number. | None |
| DOUBLE | A 64-bit double-precision floating-point number. | None |
| DECIMAL(p, s) | An exact numeric with specified precision/scale. | p, s |
| STRING | A variable-length UTF-8 character string, limited to 2GB per value. 
| None |
| LARGE\_STRING | A variable-length UTF-8 character string for values exceeding 
2GB. | None |
| FIXED(n) | A fixed-length sequence of n bytes. | n |

## AI/ML types
| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| VECTOR(element\_type, dimension) | A dense, fixed-length vector of numeric 
values. | Element type, dimension |
| BINARY | A variable-length sequence of bytes, limited to 2GB per value. | 
None |
| LARGE\_BINARY | A variable-length sequence of bytes for values exceeding 2GB. 
| None |

## Nested Types
| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| STRUCT\<name: type, ...\> | An ordered collection of named fields. | Field 
list |
| LIST\<element\_type\> | An ordered list of elements of the same type. | 
Element type |
| MAP\<key\_type, value\_type\> | A collection of key-value pairs. Keys must be 
unique. | Key, Value types |

## Temporal types
| Logical Type | Description | Parameters |
| :---- | :---- | :---- |
| DATE | A calendar date (year, month, day). | None |
| DATE64 | A calendar date stored as milliseconds. | None |
| TIME(precision) | A time of day without a timezone. | s, ms, us, ns |
| TIMESTAMP(precision) | An instant in time without a timezone. | us or ns |
| TIMESTAMPTZ(precision) | An instant in time with a timezone, normalized and 
stored as UTC. | us or ns |



## What should the type system be backed by?

#### Option 1
Currently when looking at other table format projects in the space some such as 
Apache Iceberg, the approach they take is defining a native type system 
https://iceberg.apache.org/spec/#primitive-types which is not backed by any 
particular file format such as (Parquet, Avro, Arrow). 
https://iceberg.apache.org/spec/?h=parquet+types#parquet. 
One option is to follow a similar approach of defining our own type system as 
Java constructs: 
https://github.com/apache/iceberg/blob/main/api/src/main/java/org/apache/iceberg/types/Types.java
 and then have different engines and file formats have to convert between our 
representation. See an example such as this 
https://github.com/apache/iceberg/blob/main/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/ArrowReader.java#L68

#### Option 2
Similar to how we use Avro today within the hudi project for schema 
representation we instead leverage Apache Arrow and use arrow types as first 
class citizens within the project. For example when looking at Lance, they 
currently do not have a type system and instead leverage arrow directly. 

```
Corollary 1: There is No Type 
System[#](https://lancedb.com/blog/lance-v2/#corollary-1-there-is-no-type-system)
The Lance format itself does not have a type system. From Lance’s perspective, 
every column is simply a collection of pages (with an encoding) and each page a 
collection of buffers. Of course, the Lance readers and writers will actually 
need to convert between these pages into typed arrays of some kind. The readers 
and writers we have written use the [Arrow type 
system](https://arrow.apache.org/docs/format/Columnar.html) .
```



GitHub link: https://github.com/apache/hudi/discussions/14253

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to