[
https://issues.apache.org/jira/browse/SPARK-24261?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-24261:
---------------------------------
Labels: bulk-closed (was: )
> Spark cannot read renamed managed Hive table
> --------------------------------------------
>
> Key: SPARK-24261
> URL: https://issues.apache.org/jira/browse/SPARK-24261
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.2.0
> Reporter: Suraj Nayak
> Priority: Major
> Labels: bulk-closed
> Attachments: some_db.some_new_table.ddl,
> some_db.some_new_table_buggy_path.ddl, some_db.some_table.ddl
>
>
> When spark creates hive table using df.write.saveAsTable, it creates managed
> table in hive with SERDEPROPERTIES like
> {{WITH SERDEPROPERTIES
> ('path'='gs://some-gs-bucket/warehouse/hive/some.db/some_table') }}
> When any external user changes hive table name via Hive CLI or Hue, Hive
> makes sure the table name is changed and also the path is changed to new
> location. But it never updates the serdeproperties mentioned above.
> *Steps to Reproduce:*
> 1. Save table using spark:
> {{spark.sql("select * from
> some_db.some_table").write.saveAsTable("some_db.some_new_table")}}
> 2. In Hive CLI or Hue, run
> {{alter table some_db.some_new_table rename to
> some_db.some_new_table_buggy_path}}
> 3. Try to ready the buggy table *some_db.some_new_table_buggy_path* in spark
> {{spark.sql("select * from some_db.some_new_table_buggy_path limit
> 10").collect}}
> Spark throws following warning (Spark fails to read while hive can read this
> table):
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible
> stale CacheEntry; failed to fetch item info for:
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/ - removing from
> cache}}
> {{18/05/13 17:45:16 WARN gcsio.CacheSupplementedGoogleCloudStorage: Possible
> stale CacheEntry; failed to fetch item info for:
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table/_SUCCESS - removing
> from cache}}
> {{18/05/13 17:45:16 WARN datasources.InMemoryFileIndex: The directory
> gs://some-gs-bucket/warehouse/hive/some.db/some_new_table was not found. Was
> it deleted very recently?}}
> {{res2: Array[org.apache.spark.sql.Row] = Array()}}
> The DDLs for each of the tables are attached.
> This will create inconsistency and endusers will spend endless time in
> finding bug if data exists in both location, but spark reads it from
> different location while hive process writes the new data in new location.
> I went through similar JIRAs, but those address different issues.
> SPARK-15635 and SPARK-16570 address alter table in spark, unlike this jira,
> while other external process renames the table.
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]