[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

Liang-Chi Hsieh (JIRA) Sat, 24 Jun 2017 03:28:57 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16061899#comment-16061899
 ]


Liang-Chi Hsieh commented on SPARK-21198:
-----------------------------------------

{{CatalogImpl.listTables}} actually returns the metadata for all tables in the 
specified database. So for each table, it goes to get the metadata back. For a 
large database with many tables, it might take long time to get the results 
back.

{{SQLContext.tableNames}} goes around this interface and accesses 
{{SessionCatalog.listTables}} which just retrieves table names.

In Spark 2.1, {{SessionCatalog}} is internal and you can't access it through 
{{SparkSession}}. However, it is open up since 2.2, but it's still not an 
user-facing API. It is open up just to ease debugging.

Personally I'd think that there should be an API in user-facing {{CatalogImpl}} 
to return just table names. If users like to get more metadata, they can use 
{{CatalogImpl.getTable}} with the table name.


> SparkSession catalog is terribly slow
> -------------------------------------
>
>                 Key: SPARK-21198
>                 URL: https://issues.apache.org/jira/browse/SPARK-21198
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21198) SparkSession catalog is terribly slow

Reply via email to