mderoy opened a new issue, #10461:
URL: https://github.com/apache/iceberg/issues/10461
### Proposed Change
Traditional SQL engines support four major string types
```
char[N]
varchar[N]
nchar[N]
nvarchar[N]
```
While the iceberg format supports only one
```
string (equivalent to nvarchar[unlimited])
```
This causes some friction when migrating data from a traditional SQL engine
into iceberg tables.
### Proposal
We should add two new types to improve integrations with Traditional SQL
engines.
```
string(L) <- L is the max number of utf-8 characers
string_padded(L) <- same as above, but padded with spaces to 'L' total
characters
```
### Background
#### SQL incompatibility
firstly, there are some SQL statements who's behavior depends on the
datatype having padding up to some length. For example the like operator which
compares two strings including the padding. will differ when using a char(20)
than when using a varchar(20)
```
SYSTEM.ADMIN(ADMIN)=> create table strexample(v1 char(20));
SYSTEM.ADMIN(ADMIN)=> insert into strexample values(' f ');
SYSTEM.ADMIN(ADMIN)=> insert into strexample values(' f');
SYSTEM.ADMIN(ADMIN)=> select * from strexample a, strexample b where a.v1
like b.v1;
V1 | V1
----------------------+----------------------
f | f
f | f
f | f
f | f
(4 rows)
SYSTEM.ADMIN(ADMIN)=> create table strexamplevar(v1 nvarchar(20));
SYSTEM.ADMIN(ADMIN)=> insert into strexamplevar values(' f ');
SYSTEM.ADMIN(ADMIN)=> insert into strexamplevar values(' f');
SYSTEM.ADMIN(ADMIN)=> select * from strexamplevar a, strexamplevar b where
a.v1 like b.v1;
V1 | V1
-----+-----
f | f
f | f
(2 rows)
```
As such users cannot move their data into open formats and expect that their
queries return the same results...even if they continue to use the same engine
to query them (due to iceberg not supporting a fixed padded string type)
#### Better integration with long lived SQL engines that integrate with
iceberg
Long lived SQL engines like DB2 and Netezza (and others) were written with
the assumption that there is some max limit to strings. As such to support
iceberg tables they either need to silently truncate data, or cause queries to
fail if the strings get above some length. These engines may also use string
size to optimize their query planning. Having support for strings that are
capped at some length would allow users to offload data into iceberg and make
use of other tooling/engines in the lakehouse (like spark, presto, etc) without
compromising the memory requirements of these engines.
**for more info, see the proposal document**
### Proposal document
https://docs.google.com/document/d/1chIx22dNZcMSsS607F96ARu4m4Yyf1dbsYxQTrfQ-5o/edit?usp=sharing
### Specifications
- [X] Table
- [X] View
- [ ] REST
- [ ] Puffin
- [ ] Encryption
- [ ] Other
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]