Spark can't see hive schema updates partly because it stores the schema
in a weird way in hive metastore.


1. FROM SPARK: create a table
============
>>> spark.sql("select 1 col1, 2 
>>> col2").write.format("parquet").saveAsTable("my_table")
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)


2. FROM HIVE: alter the schema
==========
0: jdbc:hive2://localhost:10000> ALTER TABLE my_table REPLACE COLUMNS(`col1` 
int, `col2` int, `col3` string);
0: jdbc:hive2://localhost:10000> describe my_table;
+-----------+------------+----------+
| col_name  | data_type  | comment  |
+-----------+------------+----------+
| col1      | int        |          |
| col2      | int        |          |
| col3      | string     |          |
+-----------+------------+----------+


3. FROM SPARK: problem, column does not appear
==============
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)


4. FROM METASTORE DB: two ways of storing the columns
======================
metastore=# select * from "COLUMNS_V2";
 CD_ID | COMMENT | COLUMN_NAME | TYPE_NAME | INTEGER_IDX
-------+---------+-------------+-----------+-------------
     2 |         | col1        | int       |           0
     2 |         | col2        | int       |           1
     2 |         | col3        | string    |           2


metastore=# select * from "TABLE_PARAMS";
 TBL_ID |             PARAM_KEY             |                                   
                                     PARAM_VALUE

--------+-----------------------------------+-----------------------------------------------------------------------------------------------------------------------------
-------------------------------
      1 | spark.sql.sources.provider        | parquet
      1 | spark.sql.sources.schema.part.0   | 
{"type":"struct","fields":[{"name":"col1","type":"integer","nullable":true,"metadata":{}},{"name":"col2","type":"integer","n
ullable":true,"metadata":{}}]}
      1 | spark.sql.create.version          | 2.4.8
      1 | spark.sql.sources.schema.numParts | 1
      1 | last_modified_time                | 1641483180
      1 | transient_lastDdlTime             | 1641483180
      1 | last_modified_by                  | anonymous

metastore=# truncate "TABLE_PARAMS";
TRUNCATE TABLE


5. FROM SPARK: now the column magically appears
==============
>>> spark.table("my_table").printSchema()
root
 |-- col1: integer (nullable = true)
 |-- col2: integer (nullable = true)
 |-- col3: string (nullable = true)


Then is it necessary to store that stuff in the TABLE_PARAMS ?


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to