Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Venkatesan Muniappan Sat, 12 Mar 2022 14:39:51 -0800

hi,
Does anybody else have a better suggestion for my problem?.

Thanks,
Venkat
2016173438



On Fri, Mar 11, 2022 at 4:43 PM Venkatesan Muniappan <
m.venkatbe...@gmail.com> wrote:

> ok. I work for an org where such upgrades take a few months. Not an
> immediate task.
>
> Thanks,
> Venkat
> 2016173438
>
>
> On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>>
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
>> m.venkatbe...@gmail.com> wrote:
>>
>>> Thank you. I am trying to get the table definition for the existing
>>> tables. BTW, the create and show command that you executed, was it on Spark
>>> 3.x ? .
>>>
>>> Thanks,
>>> Venkat
>>> 2016173438
>>>
>>>
>>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Well I do not know what has changed. However, this should not affect
>>>> your work.
>>>>
>>>>
>>>> Try to create table in Spark
>>>>
>>>>
>>>> sqltext: String =
>>>>
>>>>   CREATE TABLE if not exists test.etcs(
>>>>
>>>>      ID INT
>>>>
>>>>    , CLUSTERED INT
>>>>
>>>>    , SCATTERED INT
>>>>
>>>>    , RANDOMISED INT
>>>>
>>>>    , RANDOM_STRING VARCHAR(50)
>>>>
>>>>    , SMALL_VC VARCHAR(10)
>>>>
>>>>    , PADDING  VARCHAR(4000)
>>>>
>>>>    , PADDING2 STRING
>>>>
>>>>   )
>>>>
>>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>>
>>>>   STORED AS PARQUET
>>>>
>>>>   TBLPROPERTIES (
>>>>
>>>>   "parquet.compress"="SNAPPY"
>>>>
>>>>  )
>>>>
>>>>
>>>> scala> spark.sql (sqltext)
>>>>
>>>> scala> spark.sql("show create table test.etcs").show(false)
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |createtab_stmt
>>>>
>>>>
>>>>
>>>>                                               |
>>>>
>>>>
>>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>
>>>> |CREATE TABLE `test`.`etcs` (
>>>>
>>>>   `ID` INT,
>>>>
>>>>   `CLUSTERED` INT,
>>>>
>>>>   `SCATTERED` INT,
>>>>
>>>>   `RANDOMISED` INT,
>>>>
>>>>   `RANDOM_STRING` VARCHAR(50),
>>>>
>>>>   `SMALL_VC` VARCHAR(10),
>>>>
>>>>   `PADDING` VARCHAR(4000),
>>>>
>>>>   `PADDING2` STRING)
>>>>
>>>> USING parquet
>>>>
>>>> CLUSTERED BY (ID)
>>>>
>>>> INTO 256 BUCKETS
>>>>
>>>> TBLPROPERTIES (
>>>>
>>>>   'transient_lastDdlTime' = '1647033659',
>>>>
>>>>   'parquet.compress' = 'SNAPPY')
>>>>
>>>> |
>>>>
>>>>
>>>> +------------------------------------------------------------------------------------------------------------------
>>>>
>>>>
>>>> Note that columns are OK.
>>>>
>>>>
>>>> Also check this link for the differences between CHAR, VARCHAR and
>>>> STRING types in Hive
>>>>
>>>>
>>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>>
>>>>
>>>> HTH
>>>>
>>>>
>>>>
>>>>   view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>>> m.venkatbe...@gmail.com> wrote:
>>>>
>>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>>
>>>>> Thanks,
>>>>> Venkat
>>>>> 2016173438
>>>>>
>>>>>
>>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com> wrote:
>>>>>
>>>>>> Hive 3.1.1
>>>>>> Spark 3.1.1
>>>>>>
>>>>>> Your stack overflow issue raised and I quote:
>>>>>>
>>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>>> completes the same thing in a few milliseconds. I am planning to use 
>>>>>> Spark
>>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR 
>>>>>> characters
>>>>>> as String and it doesn't preserve the length information that goes with
>>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>>
>>>>>> Given below the sample create table command and its show create table
>>>>>> output."
>>>>>>
>>>>>> Create a simple table in *Hive* in test database
>>>>>>
>>>>>> hive> *use test;*
>>>>>> OK
>>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>>> OK
>>>>>> hive> *desc formatted etc;*
>>>>>> # col_name              data_type               comment
>>>>>> *id                      bigint*
>>>>>> *col1                    varchar(30)*
>>>>>> *col2                    string*
>>>>>>
>>>>>> # Detailed Table Information
>>>>>> Database:               test
>>>>>> OwnerType:              USER
>>>>>> Owner:                  hduser
>>>>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>>>>> LastAccessTime:         UNKNOWN
>>>>>> Retention:              0
>>>>>> Location:
>>>>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>>> Table Type:             MANAGED_TABLE
>>>>>> Table Parameters:
>>>>>>         COLUMN_STATS_ACCURATE
>>>>>>  
>>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>>         bucketing_version       2
>>>>>>         numFiles                0
>>>>>>         numRows                 0
>>>>>>         rawDataSize             0
>>>>>>         totalSize               0
>>>>>>         transient_lastDdlTime   1647023374
>>>>>>
>>>>>> # Storage Information
>>>>>> SerDe Library:
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>>>>> OutputFormat:
>>>>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>>> Compressed:             No
>>>>>> Num Buckets:            -1
>>>>>> Bucket Columns:         []
>>>>>> Sort Columns:           []
>>>>>> Storage Desc Params:
>>>>>>         serialization.format    1
>>>>>>
>>>>>> Now let's go to spark-shell
>>>>>>                                               ^
>>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |createtab_stmt
>>>>>>
>>>>>>                                     |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>> *  `id` BIGINT,*
>>>>>> *  `col1` VARCHAR(30),*
>>>>>> *  `col2` STRING)*
>>>>>> USING text
>>>>>> TBLPROPERTIES (
>>>>>>   'bucketing_version' = '2',
>>>>>>   'transient_lastDdlTime' = '1647023374')
>>>>>> |
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> You can see Spark shows columns correctly
>>>>>>
>>>>>> Now let us go and create the same table in hive through beeline
>>>>>>
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>>
>>>>>> No rows affected (0.019 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>>
>>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>>
>>>>>> . . . . . . . . . . . . . . . . . . >
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> |           col_name            |                     data_type
>>>>>>                 |                      comment                       |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> | # col_name                    | data_type
>>>>>>                 | comment                                            |
>>>>>>
>>>>>> *| id                            | bigint
>>>>>>                  |                                                    |*
>>>>>>
>>>>>> *| col1                          | varchar(30)
>>>>>>                 |                                                    |*
>>>>>>
>>>>>> *| col2                          | string
>>>>>>                  |                                                    |*
>>>>>>
>>>>>> |                               | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | # Detailed Table Information  | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Database:                     | test
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | OwnerType:                    | USER
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Owner:                        | hduser
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | LastAccessTime:               | UNKNOWN
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Retention:                    | 0
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Location:                     |
>>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>>>                              |
>>>>>>
>>>>>> | Table Type:                   | MANAGED_TABLE
>>>>>>                 | NULL                                               |
>>>>>>
>>>>>> | Table Parameters:             | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> |                               | COLUMN_STATS_ACCURATE
>>>>>>                 |
>>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>> |
>>>>>>
>>>>>> |                               | bucketing_version
>>>>>>                 | 2                                                  |
>>>>>>
>>>>>> |                               | numFiles
>>>>>>                | 0                                                  |
>>>>>>
>>>>>> |                               | numRows
>>>>>>                 | 0
>>>>>> |
>>>>>>
>>>>>> |                               | rawDataSize
>>>>>>                 | 0                                                  |
>>>>>>
>>>>>> |                               | totalSize
>>>>>>                 | 0                                                  |
>>>>>>
>>>>>> |                               | transient_lastDdlTime
>>>>>>                 | 1647024660                                         |
>>>>>>
>>>>>> |                               | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | # Storage Information         | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | SerDe Library:                |
>>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>>>                              |
>>>>>>
>>>>>> | InputFormat:                  |
>>>>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>>>>                              |
>>>>>>
>>>>>> | OutputFormat:                 |
>>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>>>                                      |
>>>>>>
>>>>>> | Compressed:                   | No
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Num Buckets:                  | -1
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Bucket Columns:               | []
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Sort Columns:                 | []
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> | Storage Desc Params:          | NULL
>>>>>>                | NULL                                               |
>>>>>>
>>>>>> |                               | serialization.format
>>>>>>                | 1                                                  |
>>>>>>
>>>>>>
>>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>>
>>>>>> 33 rows selected (0.159 seconds)
>>>>>>
>>>>>> Now check that in spark-shell again
>>>>>>
>>>>>>
>>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |createtab_stmt
>>>>>>
>>>>>>                                     |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>> |CREATE TABLE `test`.`etc` (
>>>>>>
>>>>>> *  `id` BIGINT,*
>>>>>>
>>>>>> *  `col1` VARCHAR(30),*
>>>>>>
>>>>>> *  `col2` STRING)*
>>>>>>
>>>>>> USING text
>>>>>>
>>>>>> TBLPROPERTIES (
>>>>>>
>>>>>>   'bucketing_version' = '2',
>>>>>>
>>>>>>   'transient_lastDdlTime' = '1647024660')
>>>>>>
>>>>>> |
>>>>>>
>>>>>>
>>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>>
>>>>>>
>>>>>> It shows OK.  Soo in summary you get column definitions in Spark as
>>>>>> you have defined them in Hive
>>>>>>
>>>>>>
>>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>>> have had such issues.
>>>>>>
>>>>>>
>>>>>> HTH
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction of data or any other property which 
>>>>>> may
>>>>>> arise from relying on this email's technical content is explicitly
>>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>>> arising from such loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>>> m.venkatbe...@gmail.com> wrote:
>>>>>>
>>>>>>> hi Spark Team,
>>>>>>>
>>>>>>> I have raised a question on Spark through Stackoverflow. When you
>>>>>>> get a chance, can you please take a look and help me ?.
>>>>>>>
>>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Venkat
>>>>>>> 2016173438
>>>>>>>
>>>>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Reply via email to