hi, Does anybody else have a better suggestion for my problem?. Thanks, Venkat 2016173438
On Fri, Mar 11, 2022 at 4:43 PM Venkatesan Muniappan < m.venkatbe...@gmail.com> wrote: > ok. I work for an org where such upgrades take a few months. Not an > immediate task. > > Thanks, > Venkat > 2016173438 > > > On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mich.talebza...@gmail.com> > wrote: > >> yes in spark 3.1.1. Best to upgrade it to spark 3+. >> >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan < >> m.venkatbe...@gmail.com> wrote: >> >>> Thank you. I am trying to get the table definition for the existing >>> tables. BTW, the create and show command that you executed, was it on Spark >>> 3.x ? . >>> >>> Thanks, >>> Venkat >>> 2016173438 >>> >>> >>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> Well I do not know what has changed. However, this should not affect >>>> your work. >>>> >>>> >>>> Try to create table in Spark >>>> >>>> >>>> sqltext: String = >>>> >>>> CREATE TABLE if not exists test.etcs( >>>> >>>> ID INT >>>> >>>> , CLUSTERED INT >>>> >>>> , SCATTERED INT >>>> >>>> , RANDOMISED INT >>>> >>>> , RANDOM_STRING VARCHAR(50) >>>> >>>> , SMALL_VC VARCHAR(10) >>>> >>>> , PADDING VARCHAR(4000) >>>> >>>> , PADDING2 STRING >>>> >>>> ) >>>> >>>> CLUSTERED BY (ID) INTO 256 BUCKETS >>>> >>>> STORED AS PARQUET >>>> >>>> TBLPROPERTIES ( >>>> >>>> "parquet.compress"="SNAPPY" >>>> >>>> ) >>>> >>>> >>>> scala> spark.sql (sqltext) >>>> >>>> scala> spark.sql("show create table test.etcs").show(false) >>>> >>>> >>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>> >>>> |createtab_stmt >>>> >>>> >>>> >>>> | >>>> >>>> >>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>> >>>> |CREATE TABLE `test`.`etcs` ( >>>> >>>> `ID` INT, >>>> >>>> `CLUSTERED` INT, >>>> >>>> `SCATTERED` INT, >>>> >>>> `RANDOMISED` INT, >>>> >>>> `RANDOM_STRING` VARCHAR(50), >>>> >>>> `SMALL_VC` VARCHAR(10), >>>> >>>> `PADDING` VARCHAR(4000), >>>> >>>> `PADDING2` STRING) >>>> >>>> USING parquet >>>> >>>> CLUSTERED BY (ID) >>>> >>>> INTO 256 BUCKETS >>>> >>>> TBLPROPERTIES ( >>>> >>>> 'transient_lastDdlTime' = '1647033659', >>>> >>>> 'parquet.compress' = 'SNAPPY') >>>> >>>> | >>>> >>>> >>>> +------------------------------------------------------------------------------------------------------------------ >>>> >>>> >>>> Note that columns are OK. >>>> >>>> >>>> Also check this link for the differences between CHAR, VARCHAR and >>>> STRING types in Hive >>>> >>>> >>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types >>>> >>>> >>>> HTH >>>> >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>>> any loss, damage or destruction of data or any other property which may >>>> arise from relying on this email's technical content is explicitly >>>> disclaimed. The author will in no case be liable for any monetary damages >>>> arising from such loss, damage or destruction. >>>> >>>> >>>> >>>> >>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan < >>>> m.venkatbe...@gmail.com> wrote: >>>> >>>>> Thank you Mich Talebzadeh for your answer. It's good to know that >>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed >>>>> in Spark 3 that made this possible?. Or how can I achieve the same output >>>>> in Spark 2.4.1? If there are some conf options, that would be helpful. >>>>> >>>>> Thanks, >>>>> Venkat >>>>> 2016173438 >>>>> >>>>> >>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh < >>>>> mich.talebza...@gmail.com> wrote: >>>>> >>>>>> Hive 3.1.1 >>>>>> Spark 3.1.1 >>>>>> >>>>>> Your stack overflow issue raised and I quote: >>>>>> >>>>>> "I have a need to generate DDL statements for Hive tables & views >>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline >>>>>> takes around 5-10 seconds for each of the statements whereas Spark >>>>>> completes the same thing in a few milliseconds. I am planning to use >>>>>> Spark >>>>>> since it is faster compared to beeline. One downside of using spark for >>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR >>>>>> characters >>>>>> as String and it doesn't preserve the length information that goes with >>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type >>>>>> and the length information for CHAR,VARCHAR data types. *I am using >>>>>> Spark 2.4.1 and Beeline 2.1.1.* >>>>>> >>>>>> Given below the sample create table command and its show create table >>>>>> output." >>>>>> >>>>>> Create a simple table in *Hive* in test database >>>>>> >>>>>> hive> *use test;* >>>>>> OK >>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);* >>>>>> OK >>>>>> hive> *desc formatted etc;* >>>>>> # col_name data_type comment >>>>>> *id bigint* >>>>>> *col1 varchar(30)* >>>>>> *col2 string* >>>>>> >>>>>> # Detailed Table Information >>>>>> Database: test >>>>>> OwnerType: USER >>>>>> Owner: hduser >>>>>> CreateTime: Fri Mar 11 18:29:34 GMT 2022 >>>>>> LastAccessTime: UNKNOWN >>>>>> Retention: 0 >>>>>> Location: >>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc >>>>>> Table Type: MANAGED_TABLE >>>>>> Table Parameters: >>>>>> COLUMN_STATS_ACCURATE >>>>>> >>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} >>>>>> bucketing_version 2 >>>>>> numFiles 0 >>>>>> numRows 0 >>>>>> rawDataSize 0 >>>>>> totalSize 0 >>>>>> transient_lastDdlTime 1647023374 >>>>>> >>>>>> # Storage Information >>>>>> SerDe Library: >>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe >>>>>> InputFormat: org.apache.hadoop.mapred.TextInputFormat >>>>>> OutputFormat: >>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat >>>>>> Compressed: No >>>>>> Num Buckets: -1 >>>>>> Bucket Columns: [] >>>>>> Sort Columns: [] >>>>>> Storage Desc Params: >>>>>> serialization.format 1 >>>>>> >>>>>> Now let's go to spark-shell >>>>>> ^ >>>>>> scala> *spark.sql("show create table test.etc").show(false)* >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> |createtab_stmt >>>>>> >>>>>> | >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> |CREATE TABLE `test`.`etc` ( >>>>>> * `id` BIGINT,* >>>>>> * `col1` VARCHAR(30),* >>>>>> * `col2` STRING)* >>>>>> USING text >>>>>> TBLPROPERTIES ( >>>>>> 'bucketing_version' = '2', >>>>>> 'transient_lastDdlTime' = '1647023374') >>>>>> | >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> >>>>>> You can see Spark shows columns correctly >>>>>> >>>>>> Now let us go and create the same table in hive through beeline >>>>>> >>>>>> >>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test* >>>>>> >>>>>> No rows affected (0.019 seconds) >>>>>> >>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, >>>>>> col1 VARCHAR(30), col2 STRING)* >>>>>> >>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds) >>>>>> >>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc* >>>>>> >>>>>> . . . . . . . . . . . . . . . . . . > >>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>>>>> >>>>>> | col_name | data_type >>>>>> | comment | >>>>>> >>>>>> >>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>>>>> >>>>>> | # col_name | data_type >>>>>> | comment | >>>>>> >>>>>> *| id | bigint >>>>>> | |* >>>>>> >>>>>> *| col1 | varchar(30) >>>>>> | |* >>>>>> >>>>>> *| col2 | string >>>>>> | |* >>>>>> >>>>>> | | NULL >>>>>> | NULL | >>>>>> >>>>>> | # Detailed Table Information | NULL >>>>>> | NULL | >>>>>> >>>>>> | Database: | test >>>>>> | NULL | >>>>>> >>>>>> | OwnerType: | USER >>>>>> | NULL | >>>>>> >>>>>> | Owner: | hduser >>>>>> | NULL | >>>>>> >>>>>> | CreateTime: | Fri Mar 11 18:51:00 GMT 2022 >>>>>> | NULL | >>>>>> >>>>>> | LastAccessTime: | UNKNOWN >>>>>> | NULL | >>>>>> >>>>>> | Retention: | 0 >>>>>> | NULL | >>>>>> >>>>>> | Location: | >>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL >>>>>> | >>>>>> >>>>>> | Table Type: | MANAGED_TABLE >>>>>> | NULL | >>>>>> >>>>>> | Table Parameters: | NULL >>>>>> | NULL | >>>>>> >>>>>> | | COLUMN_STATS_ACCURATE >>>>>> | >>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}} >>>>>> | >>>>>> >>>>>> | | bucketing_version >>>>>> | 2 | >>>>>> >>>>>> | | numFiles >>>>>> | 0 | >>>>>> >>>>>> | | numRows >>>>>> | 0 >>>>>> | >>>>>> >>>>>> | | rawDataSize >>>>>> | 0 | >>>>>> >>>>>> | | totalSize >>>>>> | 0 | >>>>>> >>>>>> | | transient_lastDdlTime >>>>>> | 1647024660 | >>>>>> >>>>>> | | NULL >>>>>> | NULL | >>>>>> >>>>>> | # Storage Information | NULL >>>>>> | NULL | >>>>>> >>>>>> | SerDe Library: | >>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL >>>>>> | >>>>>> >>>>>> | InputFormat: | >>>>>> org.apache.hadoop.mapred.TextInputFormat | NULL >>>>>> | >>>>>> >>>>>> | OutputFormat: | >>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL >>>>>> | >>>>>> >>>>>> | Compressed: | No >>>>>> | NULL | >>>>>> >>>>>> | Num Buckets: | -1 >>>>>> | NULL | >>>>>> >>>>>> | Bucket Columns: | [] >>>>>> | NULL | >>>>>> >>>>>> | Sort Columns: | [] >>>>>> | NULL | >>>>>> >>>>>> | Storage Desc Params: | NULL >>>>>> | NULL | >>>>>> >>>>>> | | serialization.format >>>>>> | 1 | >>>>>> >>>>>> >>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+ >>>>>> >>>>>> 33 rows selected (0.159 seconds) >>>>>> >>>>>> Now check that in spark-shell again >>>>>> >>>>>> >>>>>> scala> spark.sql("show create table test.etc").show(false) >>>>>> >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> >>>>>> |createtab_stmt >>>>>> >>>>>> | >>>>>> >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> >>>>>> |CREATE TABLE `test`.`etc` ( >>>>>> >>>>>> * `id` BIGINT,* >>>>>> >>>>>> * `col1` VARCHAR(30),* >>>>>> >>>>>> * `col2` STRING)* >>>>>> >>>>>> USING text >>>>>> >>>>>> TBLPROPERTIES ( >>>>>> >>>>>> 'bucketing_version' = '2', >>>>>> >>>>>> 'transient_lastDdlTime' = '1647024660') >>>>>> >>>>>> | >>>>>> >>>>>> >>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ >>>>>> >>>>>> >>>>>> It shows OK. Soo in summary you get column definitions in Spark as >>>>>> you have defined them in Hive >>>>>> >>>>>> >>>>>> In your statement above and I quote "I am using Spark 2.4.1 and >>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may >>>>>> have had such issues. >>>>>> >>>>>> >>>>>> HTH >>>>>> >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility >>>>>> for any loss, damage or destruction of data or any other property which >>>>>> may >>>>>> arise from relying on this email's technical content is explicitly >>>>>> disclaimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan < >>>>>> m.venkatbe...@gmail.com> wrote: >>>>>> >>>>>>> hi Spark Team, >>>>>>> >>>>>>> I have raised a question on Spark through Stackoverflow. When you >>>>>>> get a chance, can you please take a look and help me ?. >>>>>>> >>>>>>> https://stackoverflow.com/q/71431757/5927843 >>>>>>> >>>>>>> Thanks, >>>>>>> Venkat >>>>>>> 2016173438 >>>>>>> >>>>>>