Re: [Discussion] Char/VarChar Implementation in Carbondata

Akash R Nilugal Tue, 19 Oct 2021 05:20:21 -0700

Hi,

Prefer to keep the current behavior of carbondata long_string_column support as 
it is. Support new datatypes of spark.
But the discussion mail do not have clear picture of how it will be handled in 
carbon and what will be the role of spark in handling and validating these. If 
you can prepare a document and add examples and proper design there will code 
level info, that would be great.


Thanks,

Regards,
Akash R


On 2021/10/10 19:47:31, Mahesh Raju Somalaraju <maheshraju.o...@gmail.com> 
wrote: 
> Dear Community,
> 
> This mail is regarding char/varchar implementation in carbondata. Recently
> Spark3.1 is added char/varchar implementation[*#1*].
> 
> *command reference:*
> 1. create table charVarchar (id int, country varchar(10), name char(5),
> addr string) stored as carbondata;
> 2. insert into charVarchar select 1, 'india', 'mahesh', 'bangalore');
> 
>      VarcharType(length): A variant of `StringType` which has a length
> limitation. Data writing will fail if the input string exceeds the length
> limitation. Note: this type can only be used in table schema, not
> functions/operators.
> 
>       CharType(length): A variant of `VarcharType(length)` which is fixed
> length. Reading column of type `CharType(n)` always returns string values
> of length `n`. Char type column comparison will pad the short one to the
> longer length.
> 
> *Current behaviour[CarbonData]:*
> carbondata existing varchar implementation is different from spark.
> Carbondata will treat the column as a varchar column if the string column
> data type is configured with long_string_columns. long_string_columns we
> can configure in table properties.
> 
> If we execute above commands with carbondata,
> Carbondata will convert the char/varchar column data types to string column
> and load the data without any length checks(char(5): it will allow more
> than 5 characters, varchar(10): It will allow more than 10 characters).
>    - String values we can give max[max size of short which is 32k]. If this
> column is normal string
>    - String values we can give max[max size of Integer which is more than
> 32k]. If this column is configured with long_string_columns.
> 
> *Spark & parquet Behaviour:*
> 1) If we give above commands with parquet, Storing as char/varchar data
> types and validating the string lengths given in in create table commands.
> If length mismatch then load/insert command will fail with parse exception.
> 2) If we mention char(n) where n is a big number and given small length in
> loading then spark is padding with trailing spaces below cases.
>   i) Do string padding when reading char type columns. Spark doesn't do it
> at the writing side  to save storage space.
>   ii) Do string padding when comparing char type column with string literal
> or another char type column
> 
> More details about spark implementation we can refer PR[*#1*]
> 
> *Proposed Solution:*
> 1) Keep existing carbondata varchar implementation[String columns with
> long_string_columns] as if we remove it then compatibility issues may come.
> 2) Support new column data types char(n) and varchar(n). Show them in
> metadata as actual charType(n) and varCharType(n) instead of string columns.
> 3) Handle the length check for char/varchar in the
> partition/non-partitioned case. If length mismatch then throw a parse
> exception.
> 4) phase-1 develop for primitive and phase-2 check for complex columns
> 
> *Benefits*:
> char and varchar are standard SQL types. varchar is widely used in other
> databases instead of string type.
> 
> *#1 *https://github.com/apache/spark/pull/30412
> 
> *Please provide your valuable inputs and suggestions. Thank you in advance
> !*
> 
> Thanks & Regards
> -Mahesh Raju Somalaraju
> github id: maheshrajus
>

Re: [Discussion] Char/VarChar Implementation in Carbondata

Reply via email to