Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed in Spark 3 that made this possible?. Or how can I achieve the same output in Spark 2.4.1? If there are some conf options, that would be helpful.
Thanks, Venkat 2016173438 On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Hive 3.1.1 > Spark 3.1.1 > > Your stack overflow issue raised and I quote: > > "I have a need to generate DDL statements for Hive tables & views > programmatically. I tried using Spark and Beeline for this task. Beeline > takes around 5-10 seconds for each of the statements whereas Spark > completes the same thing in a few milliseconds. I am planning to use Spark > since it is faster compared to beeline. One downside of using spark for > getting DDL statements from the hive is, it treats CHAR, VARCHAR characters > as String and it doesn't preserve the length information that goes with > CHAR,VARCHAR data types. At the same time beeline preserves the data type > and the length information for CHAR,VARCHAR data types. *I am using Spark > 2.4.1 and Beeline 2.1.1.* > > Given below the sample create table command and its show create table > output." > > Create a simple table in *Hive* in test database > > hive> *use test;* > OK > hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);* > OK > hive> *desc formatted etc;* > # col_name data_type comment > *id bigint* > *col1 varchar(30)* > *col2 string* > > # Detailed Table Information > Database: test > OwnerType: USER > Owner: hduser > CreateTime: Fri Mar 11 18:29:34 GMT 2022 > LastAccessTime: UNKNOWN > Retention: 0 > Location: hdfs://rhes75:9000/user/hive/warehouse/test.db/etc > Table Type: MANAGED_TABLE > Table Parameters: > COLUMN_STATS_ACCURATE > > {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} > bucketing_version 2 > numFiles 0 > numRows 0 > rawDataSize 0 > totalSize 0 > transient_lastDdlTime 1647023374 > > # Storage Information > SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe > InputFormat: org.apache.hadoop.mapred.TextInputFormat > OutputFormat: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > Compressed: No > Num Buckets: -1 > Bucket Columns: [] > Sort Columns: [] > Storage Desc Params: > serialization.format 1 > > Now let's go to spark-shell > ^ > scala> *spark.sql("show create table test.etc").show(false)* > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |createtab_stmt > > | > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > |CREATE TABLE `test`.`etc` ( > * `id` BIGINT,* > * `col1` VARCHAR(30),* > * `col2` STRING)* > USING text > TBLPROPERTIES ( > 'bucketing_version' = '2', > 'transient_lastDdlTime' = '1647023374') > | > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > You can see Spark shows columns correctly > > Now let us go and create the same table in hive through beeline > > > 0: jdbc:hive2://rhes75:10099/default>* use test* > > No rows affected (0.019 seconds) > > 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1 > VARCHAR(30), col2 STRING)* > > . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds) > > 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc* > > . . . . . . . . . . . . . . . . . . > > +-------------------------------+----------------------------------------------------+----------------------------------------------------+ > > | col_name | data_type > | comment | > > > +-------------------------------+----------------------------------------------------+----------------------------------------------------+ > > | # col_name | data_type > | comment | > > *| id | bigint > | |* > > *| col1 | varchar(30) > | |* > > *| col2 | string > | |* > > | | NULL > | NULL | > > | # Detailed Table Information | NULL > | NULL | > > | Database: | test > | NULL | > > | OwnerType: | USER > | NULL | > > | Owner: | hduser > | NULL | > > | CreateTime: | Fri Mar 11 18:51:00 GMT 2022 > | NULL | > > | LastAccessTime: | UNKNOWN > | NULL | > > | Retention: | 0 > | NULL | > > | Location: | > hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL > | > > | Table Type: | MANAGED_TABLE > | NULL | > > | Table Parameters: | NULL > | NULL | > > | | COLUMN_STATS_ACCURATE > | > {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} > | > > | | bucketing_version > | 2 | > > | | numFiles > | 0 | > > | | numRows > | 0 | > > | | rawDataSize > | 0 | > > | | totalSize > | 0 | > > | | transient_lastDdlTime > | 1647024660 | > > | | NULL > | NULL | > > | # Storage Information | NULL > | NULL | > > | SerDe Library: | > org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL > | > > | InputFormat: | > org.apache.hadoop.mapred.TextInputFormat | NULL > | > > | OutputFormat: | > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL > | > > | Compressed: | No > | NULL | > > | Num Buckets: | -1 > | NULL | > > | Bucket Columns: | [] > | NULL | > > | Sort Columns: | [] > | NULL | > > | Storage Desc Params: | NULL > | NULL | > > | | serialization.format > | 1 | > > > +-------------------------------+----------------------------------------------------+----------------------------------------------------+ > > 33 rows selected (0.159 seconds) > > Now check that in spark-shell again > > > scala> spark.sql("show create table test.etc").show(false) > > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > |createtab_stmt > > | > > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > |CREATE TABLE `test`.`etc` ( > > * `id` BIGINT,* > > * `col1` VARCHAR(30),* > > * `col2` STRING)* > > USING text > > TBLPROPERTIES ( > > 'bucketing_version' = '2', > > 'transient_lastDdlTime' = '1647024660') > > | > > > +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > > It shows OK. Soo in summary you get column definitions in Spark as you > have defined them in Hive > > > In your statement above and I quote "I am using Spark 2.4.1 and Beeline > 2.1.1", refers to older versions of Spark and hive which may have had > such issues. > > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan < > m.venkatbe...@gmail.com> wrote: > >> hi Spark Team, >> >> I have raised a question on Spark through Stackoverflow. When you get a >> chance, can you please take a look and help me ?. >> >> https://stackoverflow.com/q/71431757/5927843 >> >> Thanks, >> Venkat >> 2016173438 >> >