[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Saif Addin updated SPARK-21198: ------------------------------- Description: We have a considerably large Hive metastore and a Spark program that goes through Hive data availability. In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and sqlContext.isCached() to go throgh Hive metastore information. Once migrated to spark 2.x we switched over SparkSession.catalog instead, but it turns out that both listDatabases() and listTables() take between 5 to 20 minutes depending on the database to return results, using operations such as the following one: spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect and made the program unbearably to return a list of tables. I know we still have spark.sqlContext.tableNames as workaround but I am assuming this is going to be deprecated anytime soon? was: We have a considerably large Hive metastore and a Spark program that goes through Hive data availability. In spark 1.x, we were using sqlConext.tableNames or sqlContext.sql() to go throgh Hive. Once migrated to spark 2.x we switched over SparkSession.catalog instead, but it turns out that both listDatabases() and listTables() take between 5 to 20 minutes depending on the database to return results, using operations such as the following one: spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect and made the program unbearably to return a list of tables. I know we still have spark.sqlContext.tableNames as workaround but I am assuming this is going to be deprecated anytime soon? > SparkSession catalog is terribly slow > ------------------------------------- > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.1.0 > Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org