[jira] [Updated] (SPARK-21198) SparkSession catalog is terribly slow

Saif Addin (JIRA) Fri, 23 Jun 2017 16:00:21 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Saif Addin updated SPARK-21198:
-------------------------------
    Description: 
We have a considerably large Hive metastore and a Spark program that goes 
through Hive data availability.

In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
sqlContext.isCached() to go throgh Hive metastore information.
Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
it turns out that both listDatabases() and listTables() take between 5 to 20 
minutes depending on the database to return results, using operations such as 
the following one:

spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect

and made the program unbearably to return a list of tables.

I know we still have spark.sqlContext.tableNames as workaround but I am 
assuming this is going to be deprecated anytime soon?

  was:
We have a considerably large Hive metastore and a Spark program that goes 
through Hive data availability.

In spark 1.x, we were using sqlConext.tableNames or sqlContext.sql() to go 
throgh Hive.
Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
it turns out that both listDatabases() and listTables() take between 5 to 20 
minutes depending on the database to return results, using operations such as 
the following one:

spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect

and made the program unbearably to return a list of tables.

I know we still have spark.sqlContext.tableNames as workaround but I am 
assuming this is going to be deprecated anytime soon?


> SparkSession catalog is terribly slow
> -------------------------------------
>
>                 Key: SPARK-21198
>                 URL: https://issues.apache.org/jira/browse/SPARK-21198
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-21198) SparkSession catalog is terribly slow

Reply via email to