[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

Hyukjin Kwon (JIRA) Mon, 20 May 2019 21:27:51 -0700


     [ 
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Hyukjin Kwon updated SPARK-24261:
---------------------------------
    Labels: bulk-closed  (was: )

> Spark cannot read renamed managed Hive table
> --------------------------------------------
>
>                 Key: SPARK-24261
>                 URL: https://issues.apache.org/jira/browse/SPARK-24261
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Suraj Nayak
>            Priority: Major
>              Labels: bulk-closed
>         Attachments: some_db.some_new_table.ddl, 
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed 
> table in hive with SERDEPROPERTIES like 
> {{WITH SERDEPROPERTIES 
> ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive 
> makes sure the table name is changed and also the path is changed to new 
> location. But it never updates the serdeproperties mentioned above. 
> *Steps to Reproduce:*
> 1. Save table using spark:
>  {{spark.sql("select * from 
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run 
> {{alter table some_db.some_new_table rename to 
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark 
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit 
> 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this 
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from 
> cache}}
>  {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible 
> stale CacheEntry; failed to fetch item info for: 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing 
> from cache}}
>  {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory 
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was 
> it deleted very recently?}}
>  {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached. 
> This will create inconsistency and endusers will spend endless time in 
> finding bug if data exists in both location, but spark reads it from 
> different location while hive process writes the new data in new location. 
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira, 
> while other external process renames the table.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-24261) Spark cannot read renamed managed Hive table

Reply via email to