Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Venkatesan Muniappan Fri, 11 Mar 2022 14:06:07 -0800

Thank you. I am trying to get the table definition for the existing tables.
BTW, the create and show command that you executed, was it on Spark 3.x ? .


Thanks,
Venkat
2016173438


On Fri, Mar 11, 2022 at 4:28 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Well I do not know what has changed. However, this should not affect your
> work.
>
>
> Try to create table in Spark
>
>
> sqltext: String =
>
>   CREATE TABLE if not exists test.etcs(
>
>      ID INT
>
>    , CLUSTERED INT
>
>    , SCATTERED INT
>
>    , RANDOMISED INT
>
>    , RANDOM_STRING VARCHAR(50)
>
>    , SMALL_VC VARCHAR(10)
>
>    , PADDING  VARCHAR(4000)
>
>    , PADDING2 STRING
>
>   )
>
>   CLUSTERED BY (ID) INTO 256 BUCKETS
>
>   STORED AS PARQUET
>
>   TBLPROPERTIES (
>
>   "parquet.compress"="SNAPPY"
>
>  )
>
>
> scala> spark.sql (sqltext)
>
> scala> spark.sql("show create table test.etcs").show(false)
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |createtab_stmt
>
>
>
>                                           |
>
>
> +--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>
> |CREATE TABLE `test`.`etcs` (
>
>   `ID` INT,
>
>   `CLUSTERED` INT,
>
>   `SCATTERED` INT,
>
>   `RANDOMISED` INT,
>
>   `RANDOM_STRING` VARCHAR(50),
>
>   `SMALL_VC` VARCHAR(10),
>
>   `PADDING` VARCHAR(4000),
>
>   `PADDING2` STRING)
>
> USING parquet
>
> CLUSTERED BY (ID)
>
> INTO 256 BUCKETS
>
> TBLPROPERTIES (
>
>   'transient_lastDdlTime' = '1647033659',
>
>   'parquet.compress' = 'SNAPPY')
>
> |
>
>
> +------------------------------------------------------------------------------------------------------------------
>
>
> Note that columns are OK.
>
>
> Also check this link for the differences between CHAR, VARCHAR and STRING
> types in Hive
>
>
> https://cwiki.apache.org/confluence/display/hive/languagemanual+types
>
>
> HTH
>
>
>
>   view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Fri, 11 Mar 2022 at 20:55, Venkatesan Muniappan <
> m.venkatbe...@gmail.com> wrote:
>
>> Thank you Mich Talebzadeh for your answer. It's good to know that VARCHAR
>> and CHAR are properly showing in Spark 3. Do you know what changed in Spark
>> 3 that made this possible?. Or how can I achieve the same output in Spark
>> 2.4.1? If there are some conf options, that would be helpful.
>>
>> Thanks,
>> Venkat
>> 2016173438
>>
>>
>> On Fri, Mar 11, 2022 at 2:06 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> Hive 3.1.1
>>> Spark 3.1.1
>>>
>>> Your stack overflow issue raised and I quote:
>>>
>>> "I have a need to generate DDL statements for Hive tables & views
>>> programmatically. I tried using Spark and Beeline for this task. Beeline
>>> takes around 5-10 seconds for each of the statements whereas Spark
>>> completes the same thing in a few milliseconds. I am planning to use Spark
>>> since it is faster compared to beeline. One downside of using spark for
>>> getting DDL statements from the hive is, it treats CHAR, VARCHAR characters
>>> as String and it doesn't preserve the length information that goes with
>>> CHAR,VARCHAR data types. At the same time beeline preserves the data type
>>> and the length information for CHAR,VARCHAR data types. *I am using
>>> Spark 2.4.1 and Beeline 2.1.1.*
>>>
>>> Given below the sample create table command and its show create table
>>> output."
>>>
>>> Create a simple table in *Hive* in test database
>>>
>>> hive> *use test;*
>>> OK
>>> hive> *create table etc(ID BIGINT, col1 VARCHAR(30), col2 STRING);*
>>> OK
>>> hive> *desc formatted etc;*
>>> # col_name              data_type               comment
>>> *id                      bigint*
>>> *col1                    varchar(30)*
>>> *col2                    string*
>>>
>>> # Detailed Table Information
>>> Database:               test
>>> OwnerType:              USER
>>> Owner:                  hduser
>>> CreateTime:             Fri Mar 11 18:29:34 GMT 2022
>>> LastAccessTime:         UNKNOWN
>>> Retention:              0
>>> Location:
>>>  hdfs://rhes75:9000/user/hive/warehouse/test.db/etc
>>> Table Type:             MANAGED_TABLE
>>> Table Parameters:
>>>         COLUMN_STATS_ACCURATE
>>>  
>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>>         bucketing_version       2
>>>         numFiles                0
>>>         numRows                 0
>>>         rawDataSize             0
>>>         totalSize               0
>>>         transient_lastDdlTime   1647023374
>>>
>>> # Storage Information
>>> SerDe Library:
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>>> InputFormat:            org.apache.hadoop.mapred.TextInputFormat
>>> OutputFormat:
>>>  org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>>> Compressed:             No
>>> Num Buckets:            -1
>>> Bucket Columns:         []
>>> Sort Columns:           []
>>> Storage Desc Params:
>>>         serialization.format    1
>>>
>>> Now let's go to spark-shell
>>>                                               ^
>>> scala> *spark.sql("show create table test.etc").show(false)*
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |createtab_stmt
>>>
>>>                                 |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>> |CREATE TABLE `test`.`etc` (
>>> *  `id` BIGINT,*
>>> *  `col1` VARCHAR(30),*
>>> *  `col2` STRING)*
>>> USING text
>>> TBLPROPERTIES (
>>>   'bucketing_version' = '2',
>>>   'transient_lastDdlTime' = '1647023374')
>>> |
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> You can see Spark shows columns correctly
>>>
>>> Now let us go and create the same table in hive through beeline
>>>
>>>
>>> 0: jdbc:hive2://rhes75:10099/default>* use test*
>>>
>>> No rows affected (0.019 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *create table etc(ID BIGINT, col1
>>> VARCHAR(30), col2 STRING)*
>>>
>>> . . . . . . . . . . . . . . . . . . > No rows affected (0.304 seconds)
>>>
>>> 0: jdbc:hive2://rhes75:10099/default> *desc formatted etc*
>>>
>>> . . . . . . . . . . . . . . . . . . >
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> |           col_name            |                     data_type
>>>             |                      comment                       |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> | # col_name                    | data_type
>>>             | comment                                            |
>>>
>>> *| id                            | bigint
>>>              |                                                    |*
>>>
>>> *| col1                          | varchar(30)
>>>               |                                                    |*
>>>
>>> *| col2                          | string
>>>              |                                                    |*
>>>
>>> |                               | NULL
>>>              | NULL                                               |
>>>
>>> | # Detailed Table Information  | NULL
>>>              | NULL                                               |
>>>
>>> | Database:                     | test
>>>              | NULL                                               |
>>>
>>> | OwnerType:                    | USER
>>>              | NULL                                               |
>>>
>>> | Owner:                        | hduser
>>>              | NULL                                               |
>>>
>>> | CreateTime:                   | Fri Mar 11 18:51:00 GMT 2022
>>>              | NULL                                               |
>>>
>>> | LastAccessTime:               | UNKNOWN
>>>             | NULL                                               |
>>>
>>> | Retention:                    | 0
>>>             | NULL                                               |
>>>
>>> | Location:                     |
>>> hdfs://rhes75:9000/user/hive/warehouse/test.db/etc | NULL
>>>                              |
>>>
>>> | Table Type:                   | MANAGED_TABLE
>>>             | NULL                                               |
>>>
>>> | Table Parameters:             | NULL
>>>              | NULL                                               |
>>>
>>> |                               | COLUMN_STATS_ACCURATE
>>>             |
>>> {\"BASIC_STATS\":\"true\",\"COLUMN_STATS\":{\"col1\":\"true\",\"col2\":\"true\",\"id\":\"true\"}}
>>> |
>>>
>>> |                               | bucketing_version
>>>             | 2                                                  |
>>>
>>> |                               | numFiles
>>>              | 0                                                  |
>>>
>>> |                               | numRows
>>>             | 0                                                  |
>>>
>>> |                               | rawDataSize
>>>             | 0                                                  |
>>>
>>> |                               | totalSize
>>>             | 0                                                  |
>>>
>>> |                               | transient_lastDdlTime
>>>             | 1647024660                                         |
>>>
>>> |                               | NULL
>>>              | NULL                                               |
>>>
>>> | # Storage Information         | NULL
>>>              | NULL                                               |
>>>
>>> | SerDe Library:                |
>>> org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | NULL
>>>                              |
>>>
>>> | InputFormat:                  |
>>> org.apache.hadoop.mapred.TextInputFormat           | NULL
>>>                              |
>>>
>>> | OutputFormat:                 |
>>> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat | NULL
>>>                                      |
>>>
>>> | Compressed:                   | No
>>>              | NULL                                               |
>>>
>>> | Num Buckets:                  | -1
>>>              | NULL                                               |
>>>
>>> | Bucket Columns:               | []
>>>              | NULL                                               |
>>>
>>> | Sort Columns:                 | []
>>>              | NULL                                               |
>>>
>>> | Storage Desc Params:          | NULL
>>>              | NULL                                               |
>>>
>>> |                               | serialization.format
>>>              | 1                                                  |
>>>
>>>
>>> +-------------------------------+----------------------------------------------------+----------------------------------------------------+
>>>
>>> 33 rows selected (0.159 seconds)
>>>
>>> Now check that in spark-shell again
>>>
>>>
>>> scala> spark.sql("show create table test.etc").show(false)
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |createtab_stmt
>>>
>>>                                 |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>> |CREATE TABLE `test`.`etc` (
>>>
>>> *  `id` BIGINT,*
>>>
>>> *  `col1` VARCHAR(30),*
>>>
>>> *  `col2` STRING)*
>>>
>>> USING text
>>>
>>> TBLPROPERTIES (
>>>
>>>   'bucketing_version' = '2',
>>>
>>>   'transient_lastDdlTime' = '1647024660')
>>>
>>> |
>>>
>>>
>>> +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
>>>
>>>
>>> It shows OK.  Soo in summary you get column definitions in Spark as you
>>> have defined them in Hive
>>>
>>>
>>> In your statement above and I quote "I am using Spark 2.4.1 and Beeline
>>> 2.1.1", refers to older versions of Spark and hive which may have had
>>> such issues.
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Fri, 11 Mar 2022 at 18:19, Venkatesan Muniappan <
>>> m.venkatbe...@gmail.com> wrote:
>>>
>>>> hi Spark Team,
>>>>
>>>> I have raised a question on Spark through Stackoverflow. When you get a
>>>> chance, can you please take a look and help me ?.
>>>>
>>>> https://stackoverflow.com/q/71431757/5927843
>>>>
>>>> Thanks,
>>>> Venkat
>>>> 2016173438
>>>>
>>>

Re: Show create table on a Hive Table in Spark SQL - Treats CHAR, VARCHAR as STRING

Reply via email to