[I] Max Length (And Padded) String Types [iceberg]

via GitHub Fri, 07 Jun 2024 11:15:50 -0700


mderoy opened a new issue, #10461:
URL: https://github.com/apache/iceberg/issues/10461


   ### Proposed Change
   
   Traditional SQL engines support four major string types
   ```
   char[N]
   varchar[N]
   nchar[N]
   nvarchar[N]
   ```
   While the iceberg format supports only one
   ```
   string (equivalent to nvarchar[unlimited])
   ```
   This causes some friction when migrating data from a traditional SQL engine 
into iceberg tables. 
   ### Proposal
   We should add two new types to improve integrations with Traditional SQL 
engines.
   ```
   string(L) <- L is the max number of utf-8 characers
   string_padded(L) <- same as above, but padded with spaces to 'L' total 
characters
   ```
   ### Background
   #### SQL incompatibility
   firstly, there are some SQL statements who's behavior depends on the 
datatype having padding up to some length. For example the like operator which 
compares two strings including the padding. will differ when using a char(20) 
than when using a varchar(20)
   ```
   SYSTEM.ADMIN(ADMIN)=> create table strexample(v1 char(20));
   SYSTEM.ADMIN(ADMIN)=> insert into strexample values(' f ');
   SYSTEM.ADMIN(ADMIN)=> insert into strexample values(' f');
   SYSTEM.ADMIN(ADMIN)=> select * from strexample a, strexample b where a.v1 
like b.v1;
             V1          |          V1          
   ----------------------+----------------------
     f                   |  f                  
     f                   |  f                  
     f                   |  f                  
     f                   |  f                  
   (4 rows)
   
   SYSTEM.ADMIN(ADMIN)=> create table strexamplevar(v1 nvarchar(20));
   SYSTEM.ADMIN(ADMIN)=> insert into strexamplevar values(' f ');
   SYSTEM.ADMIN(ADMIN)=> insert into strexamplevar values(' f');
   SYSTEM.ADMIN(ADMIN)=> select * from strexamplevar a, strexamplevar b where 
a.v1 like b.v1;
    V1  | V1  
   -----+-----
     f  |  f 
     f  |  f
   (2 rows)
   ```
   As such users cannot move their data into open formats and expect that their 
queries return the same results...even if they continue to use the same engine 
to query them (due to iceberg not supporting a fixed padded string type)
   
   #### Better integration with long lived SQL engines that integrate with 
iceberg
   Long lived SQL engines like DB2 and Netezza (and others) were written with 
the assumption that there is some max limit to strings. As such to support 
iceberg tables they either need to silently truncate data, or cause queries to 
fail if the strings get above some length. These engines may also use string 
size to optimize their query planning. Having support for strings that are 
capped at some length would allow users to offload data into iceberg and make 
use of other tooling/engines in the lakehouse (like spark, presto, etc) without 
compromising the memory requirements of these engines.
   
   **for more info, see the proposal document**
   
   ### Proposal document
   
   
https://docs.google.com/document/d/1chIx22dNZcMSsS607F96ARu4m4Yyf1dbsYxQTrfQ-5o/edit?usp=sharing
   
   ### Specifications
   
   - [X] Table
   - [X] View
   - [ ] REST
   - [ ] Puffin
   - [ ] Encryption
   - [ ] Other


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Max Length (And Padded) String Types [iceberg]

Reply via email to