Thank you. I am trying to get the table definition for the existing tables. BTW, the create and show command that you executed, was it on Spark 3.x ? .
Thanks, Venkat 2016173438 On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Well I do not know what has changed. However, this should not affect your > work. > > > Try to create table in Spark > > > sqltext: String = > > CREATE TABLE if not exists test.etcs( > > ID INT > > , CLUSTERED INT > > , SCATTERED INT > > , RANDOMISED INT > > , RANDOM_STRING VARCHAR(50) > > , SMALL_VC VARCHAR(10) > > , PADDING VARCHAR(4000) > > , PADDING2 STRING > > ) > > CLUSTERED BY (ID) INTO 256 BUCKETS > > STORED AS PARQUET > > TBLPROPERTIES ( > > "parquet.compress"="SNAPPY" > > ) > > > scala> spark.sql (sqltext) > > scala> spark.sql("show create table test.etcs").show(false) > > > +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > |createtab_stmt > > > > | > > > +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ > > |CREATE TABLE `test`.`etcs` ( > > `ID` INT, > > `CLUSTERED` INT, > > `SCATTERED` INT, > > `RANDOMISED` INT, > > `RANDOM_STRING` VARCHAR(50), > > `SMALL_VC` VARCHAR(10), > > `PADDING` VARCHAR(4000), > > `PADDING2` STRING) > > USING parquet > > CLUSTERED BY (ID) > > INTO 256 BUCKETS > > TBLPROPERTIES ( > > 'transient_lastDdlTime' = '1647033659', > > 'parquet.compress' = 'SNAPPY') > > | > > > +------------------------------------------------------------------------------------------------------------------ > > > Note that columns are OK. > > > Also check this link for the differences between CHAR, VARCHAR and STRING > types in Hive > > > https://cwiki.apache.org/confluence/display/hive/languagemanual+types > > > HTH > > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan < > m.venkatbe...@gmail.com> wrote: > >> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR >> and CHAR are properly showing in Spark 3. Do you know what changed in Spark >> 3 that made this possible?. Or how can I achieve the same output in Spark >> 2.4.1? If there are some conf options, that would be helpful. >> >> Thanks, >> Venkat >> 2016173438 >> >> >> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> Hive 3.1.1 >>> Spark 3.1.1 >>> >>> Your stack overflow issue raised and I quote: >>> >>> "I have a need to generate DDL statements for Hive tables & views >>> programmatically. I tried using Spark and Beeline for this task. Beeline >>> takes around 5-10 seconds for each of the statements whereas Spark >>> completes the same thing in a few milliseconds. I am planning to use Spark >>> since it is faster compared to beeline. One downside of using spark for >>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters >>> as String and it doesn't preserve the length information that goes with >>> CHAR,VARCHAR data types. At the same time beeline preserves the data type >>> and the length information for CHAR,VARCHAR data types. *I am using >>> Spark 2.4.1 and Beeline 2.1.1.* >>> >>> Given below the sample create table command and its show create table >>> output." >>> >>> Create a simple table in *Hive* in test database >>> >>> hive> *use test;* >>> OK >>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);* >>> OK >>> hive> *desc formatted etc;* >>> # col_name data_type comment >>> *id bigint* >>> *col1 varchar(30)* >>> *col2 string* >>> >>> # Detailed Table Information >>> Database: test >>> OwnerType: USER >>> Owner: hduser >>> CreateTime: Fri Mar 11 18:29:34 GMT 2022 >>> LastAccessTime: UNKNOWN >>> Retention: 0 >>> Location: >>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc >>> Table Type: MANAGED_TABLE >>> Table Parameters: >>> COLUMN_STATS_ACCURATE >>> >>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} >>> bucketing_version 2 >>> numFiles 0 >>> numRows 0 >>> rawDataSize 0 >>> totalSize 0 >>> transient_lastDdlTime 1647023374 >>> >>> # Storage Information >>> SerDe Library: >>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>> InputFormat: org.apache.hadoop.mapred.TextInputFormat >>> OutputFormat: >>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>> Compressed: No >>> Num Buckets: -1 >>> Bucket Columns: [] >>> Sort Columns: [] >>> Storage Desc Params: >>> serialization.format 1 >>> >>> Now let's go to spark-shell >>> ^ >>> scala> *spark.sql("show create table test.etc").show(false)* >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> |createtab_stmt >>> >>> | >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> |CREATE TABLE `test`.`etc` ( >>> * `id` BIGINT,* >>> * `col1` VARCHAR(30),* >>> * `col2` STRING)* >>> USING text >>> TBLPROPERTIES ( >>> 'bucketing_version' = '2', >>> 'transient_lastDdlTime' = '1647023374') >>> | >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> >>> You can see Spark shows columns correctly >>> >>> Now let us go and create the same table in hive through beeline >>> >>> >>> 0: jdbc:hive2://rhes75:10099/default>* use test* >>> >>> No rows affected (0.019 seconds) >>> >>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1 >>> VARCHAR(30), col2 STRING)* >>> >>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds) >>> >>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc* >>> >>> . . . . . . . . . . . . . . . . . . > >>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>> >>> | col_name | data_type >>> | comment | >>> >>> >>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>> >>> | # col_name | data_type >>> | comment | >>> >>> *| id | bigint >>> | |* >>> >>> *| col1 | varchar(30) >>> | |* >>> >>> *| col2 | string >>> | |* >>> >>> | | NULL >>> | NULL | >>> >>> | # Detailed Table Information | NULL >>> | NULL | >>> >>> | Database: | test >>> | NULL | >>> >>> | OwnerType: | USER >>> | NULL | >>> >>> | Owner: | hduser >>> | NULL | >>> >>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022 >>> | NULL | >>> >>> | LastAccessTime: | UNKNOWN >>> | NULL | >>> >>> | Retention: | 0 >>> | NULL | >>> >>> | Location: | >>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL >>> | >>> >>> | Table Type: | MANAGED_TABLE >>> | NULL | >>> >>> | Table Parameters: | NULL >>> | NULL | >>> >>> | | COLUMN_STATS_ACCURATE >>> | >>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} >>> | >>> >>> | | bucketing_version >>> | 2 | >>> >>> | | numFiles >>> | 0 | >>> >>> | | numRows >>> | 0 | >>> >>> | | rawDataSize >>> | 0 | >>> >>> | | totalSize >>> | 0 | >>> >>> | | transient_lastDdlTime >>> | 1647024660 | >>> >>> | | NULL >>> | NULL | >>> >>> | # Storage Information | NULL >>> | NULL | >>> >>> | SerDe Library: | >>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL >>> | >>> >>> | InputFormat: | >>> org.apache.hadoop.mapred.TextInputFormat | NULL >>> | >>> >>> | OutputFormat: | >>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL >>> | >>> >>> | Compressed: | No >>> | NULL | >>> >>> | Num Buckets: | -1 >>> | NULL | >>> >>> | Bucket Columns: | [] >>> | NULL | >>> >>> | Sort Columns: | [] >>> | NULL | >>> >>> | Storage Desc Params: | NULL >>> | NULL | >>> >>> | | serialization.format >>> | 1 | >>> >>> >>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>> >>> 33 rows selected (0.159 seconds) >>> >>> Now check that in spark-shell again >>> >>> >>> scala> spark.sql("show create table test.etc").show(false) >>> >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> >>> |createtab_stmt >>> >>> | >>> >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> >>> |CREATE TABLE `test`.`etc` ( >>> >>> * `id` BIGINT,* >>> >>> * `col1` VARCHAR(30),* >>> >>> * `col2` STRING)* >>> >>> USING text >>> >>> TBLPROPERTIES ( >>> >>> 'bucketing_version' = '2', >>> >>> 'transient_lastDdlTime' = '1647024660') >>> >>> | >>> >>> >>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>> >>> >>> It shows OK. Soo in summary you get column definitions in Spark as you >>> have defined them in Hive >>> >>> >>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline >>> 2.1.1", refers to older versions of Spark and hive which may have had >>> such issues. >>> >>> >>> HTH >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>> >>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan < >>> m.venkatbe...@gmail.com> wrote: >>> >>>> hi Spark Team, >>>> >>>> I have raised a question on Spark through Stackoverflow. When you get a >>>> chance, can you please take a look and help me ?. >>>> >>>> https://stackoverflow.com/q/71431757/5927843 >>>> >>>> Thanks, >>>> Venkat >>>> 2016173438 >>>> >>>