Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Venkatesan Muniappan Fri, 11 Mar 2022 14:06:33 -0800

ok. I work for an org where such upgrades take a few months. Not an
immediate task.


Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 4:38 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> yes in spark 3.1.1. Best to upgrade it to spark 3+.
>
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 21:35, Venkatesan Muniappan <
> m.venkatbe...@gmail.com> wrote:
>
>> Thank you. I am trying to get the table definition for the existing
>> tables. BTW, the create and show command that you executed, was it on Spark
>> 3.x ? .
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Well I do not know what has changed. However, this should not affect
>>> your work.
>>>
>>>
>>> Try to create table in Spark
>>>
>>>
>>> sqltext: String =
>>>
>>>   CREATE TABLE if not exists test.etcs(
>>>
>>>      ID INT
>>>
>>>    , CLUSTERED INT
>>>
>>>    , SCATTERED INT
>>>
>>>    , RANDOMISED INT
>>>
>>>    , RANDOM_STRING VARCHAR(50)
>>>
>>>    , SMALL_VC VARCHAR(10)
>>>
>>>    , PADDING  VARCHAR(4000)
>>>
>>>    , PADDING2 STRING
>>>
>>>   )
>>>
>>>   CLUSTERED BY (ID) INTO 256 BUCKETS
>>>
>>>   STORED AS PARQUET
>>>
>>>   TBLPROPERTIES (
>>>
>>>   "parquet.compress"="SNAPPY"
>>>
>>>  )
>>>
>>>
>>> scala> spark.sql (sqltext)
>>>
>>> scala> spark.sql("show create table test.etcs").show(false)
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>>
>>>
>>>                                             |
>>>
>>>
>>> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etcs` (
>>>
>>>   `ID` INT,
>>>
>>>   `CLUSTERED` INT,
>>>
>>>   `SCATTERED` INT,
>>>
>>>   `RANDOMISED` INT,
>>>
>>>   `RANDOM_STRING` VARCHAR(50),
>>>
>>>   `SMALL_VC` VARCHAR(10),
>>>
>>>   `PADDING` VARCHAR(4000),
>>>
>>>   `PADDING2` STRING)
>>>
>>> USING parquet
>>>
>>> CLUSTERED BY (ID)
>>>
>>> INTO 256 BUCKETS
>>>
>>> TBLPROPERTIES (
>>>
>>>   'transient_lastDdlTime' = '1647033659',
>>>
>>>   'parquet.compress' = 'SNAPPY')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------
>>>
>>>
>>> Note that columns are OK.
>>>
>>>
>>> Also check this link for the differences between CHAR, VARCHAR and
>>> STRING types in Hive
>>>
>>>
>>> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>   view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
>>> m.venkatbe...@gmail.com> wrote:
>>>
>>>> Thank you Mich Talebzadeh for your answer. It's good to know that
>>>> VARCHAR and CHAR are properly showing in Spark 3. Do you know what changed
>>>> in Spark 3 that made this possible?. Or how can I achieve the same output
>>>> in Spark 2.4.1? If there are some conf options, that would be helpful.
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>>
>>>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>> Hive 3.1.1
>>>>> Spark 3.1.1
>>>>>
>>>>> Your stack overflow issue raised and I quote:
>>>>>
>>>>> "I have a need to generate DDL statements for Hive tables & views
>>>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>>>> takes around 5-10 seconds for each of the statements whereas Spark
>>>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>>>> since it is faster compared to beeline. One downside of using spark for
>>>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR 
>>>>> characters
>>>>> as String and it doesn't preserve the length information that goes with
>>>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>>>> and the length information for CHAR,VARCHAR data types. *I am using
>>>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>>>
>>>>> Given below the sample create table command and its show create table
>>>>> output."
>>>>>
>>>>> Create a simple table in *Hive* in test database
>>>>>
>>>>> hive> *use test;*
>>>>> OK
>>>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>>>> OK
>>>>> hive> *desc formatted etc;*
>>>>> # col_name              data_type               comment
>>>>> *id                      bigint*
>>>>> *col1                    varchar(30)*
>>>>> *col2                    string*
>>>>>
>>>>> # Detailed Table Information
>>>>> Database:               test
>>>>> OwnerType:              USER
>>>>> Owner:                  hduser
>>>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>>>> LastAccessTime:         UNKNOWN
>>>>> Retention:              0
>>>>> Location:
>>>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>>>> Table Type:             MANAGED_TABLE
>>>>> Table Parameters:
>>>>>         COLUMN_STATS_ACCURATE
>>>>>  
>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>>         bucketing_version       2
>>>>>         numFiles                0
>>>>>         numRows                 0
>>>>>         rawDataSize             0
>>>>>         totalSize               0
>>>>>         transient_lastDdlTime   1647023374
>>>>>
>>>>> # Storage Information
>>>>> SerDe Library:
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>>>> OutputFormat:
>>>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>>>> Compressed:             No
>>>>> Num Buckets:            -1
>>>>> Bucket Columns:         []
>>>>> Sort Columns:           []
>>>>> Storage Desc Params:
>>>>>         serialization.format    1
>>>>>
>>>>> Now let's go to spark-shell
>>>>>                                               ^
>>>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |createtab_stmt
>>>>>
>>>>>                                   |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>> |CREATE TABLE `test`.`etc` (
>>>>> *  `id` BIGINT,*
>>>>> *  `col1` VARCHAR(30),*
>>>>> *  `col2` STRING)*
>>>>> USING text
>>>>> TBLPROPERTIES (
>>>>>   'bucketing_version' = '2',
>>>>>   'transient_lastDdlTime' = '1647023374')
>>>>> |
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> You can see Spark shows columns correctly
>>>>>
>>>>> Now let us go and create the same table in hive through beeline
>>>>>
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>>>
>>>>> No rows affected (0.019 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT,
>>>>> col1 VARCHAR(30), col2 STRING)*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>>>
>>>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>>>
>>>>> . . . . . . . . . . . . . . . . . . >
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> |           col_name            |                     data_type
>>>>>               |                      comment                       |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> | # col_name                    | data_type
>>>>>               | comment                                            |
>>>>>
>>>>> *| id                            | bigint
>>>>>                |                                                    |*
>>>>>
>>>>> *| col1                          | varchar(30)
>>>>>                 |                                                    |*
>>>>>
>>>>> *| col2                          | string
>>>>>                |                                                    |*
>>>>>
>>>>> |                               | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | # Detailed Table Information  | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | Database:                     | test
>>>>>                | NULL                                               |
>>>>>
>>>>> | OwnerType:                    | USER
>>>>>                | NULL                                               |
>>>>>
>>>>> | Owner:                        | hduser
>>>>>                | NULL                                               |
>>>>>
>>>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>>>                | NULL                                               |
>>>>>
>>>>> | LastAccessTime:               | UNKNOWN
>>>>>               | NULL                                               |
>>>>>
>>>>> | Retention:                    | 0
>>>>>               | NULL                                               |
>>>>>
>>>>> | Location:                     |
>>>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>>>                              |
>>>>>
>>>>> | Table Type:                   | MANAGED_TABLE
>>>>>               | NULL                                               |
>>>>>
>>>>> | Table Parameters:             | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> |                               | COLUMN_STATS_ACCURATE
>>>>>               |
>>>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>>> |
>>>>>
>>>>> |                               | bucketing_version
>>>>>               | 2                                                  |
>>>>>
>>>>> |                               | numFiles
>>>>>                | 0                                                  |
>>>>>
>>>>> |                               | numRows
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | rawDataSize
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | totalSize
>>>>>               | 0                                                  |
>>>>>
>>>>> |                               | transient_lastDdlTime
>>>>>               | 1647024660                                         |
>>>>>
>>>>> |                               | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | # Storage Information         | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> | SerDe Library:                |
>>>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>>>                              |
>>>>>
>>>>> | InputFormat:                  |
>>>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>>>                              |
>>>>>
>>>>> | OutputFormat:                 |
>>>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>>>                                      |
>>>>>
>>>>> | Compressed:                   | No
>>>>>                | NULL                                               |
>>>>>
>>>>> | Num Buckets:                  | -1
>>>>>                | NULL                                               |
>>>>>
>>>>> | Bucket Columns:               | []
>>>>>                | NULL                                               |
>>>>>
>>>>> | Sort Columns:                 | []
>>>>>                | NULL                                               |
>>>>>
>>>>> | Storage Desc Params:          | NULL
>>>>>                | NULL                                               |
>>>>>
>>>>> |                               | serialization.format
>>>>>                | 1                                                  |
>>>>>
>>>>>
>>>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>>>
>>>>> 33 rows selected (0.159 seconds)
>>>>>
>>>>> Now check that in spark-shell again
>>>>>
>>>>>
>>>>> scala> spark.sql("show create table test.etc").show(false)
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |createtab_stmt
>>>>>
>>>>>                                   |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>> |CREATE TABLE `test`.`etc` (
>>>>>
>>>>> *  `id` BIGINT,*
>>>>>
>>>>> *  `col1` VARCHAR(30),*
>>>>>
>>>>> *  `col2` STRING)*
>>>>>
>>>>> USING text
>>>>>
>>>>> TBLPROPERTIES (
>>>>>
>>>>>   'bucketing_version' = '2',
>>>>>
>>>>>   'transient_lastDdlTime' = '1647024660')
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>>>
>>>>>
>>>>> It shows OK.  Soo in summary you get column definitions in Spark as
>>>>> you have defined them in Hive
>>>>>
>>>>>
>>>>> In your statement above and I quote "I am using Spark 2.4.1 and
>>>>> Beeline 2.1.1", refers to older versions of Spark and hive which may
>>>>> have had such issues.
>>>>>
>>>>>
>>>>> HTH
>>>>>
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>>> any loss, damage or destruction of data or any other property which may
>>>>> arise from relying on this email's technical content is explicitly
>>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>>> arising from such loss, damage or destruction.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>>>> m.venkatbe...@gmail.com> wrote:
>>>>>
>>>>>> hi Spark Team,
>>>>>>
>>>>>> I have raised a question on Spark through Stackoverflow. When you get
>>>>>> a chance, can you please take a look and help me ?.
>>>>>>
>>>>>> https://stackoverflow.com/q/71431757/5927843
>>>>>>
>>>>>> Thanks,
>>>>>> Venkat
>>>>>> 2016173438
>>>>>>
>>>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Reply via email to