[ 
https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16062070#comment-16062070
 ] 

Saif Addin edited comment on SPARK-21198 at 6/24/17 5:26 PM:
-------------------------------------------------------------

Thanks [~viirya]
My program lists on a webpage the list of available tables by clicking a 
dropdown list of databases. Tables also show schema and metadata. At the 
beggining of my program, I go through all tables to extract necessary 
information, so I necessarily have to go through all tables at least once. When 
I migrated over catalog, I though my program got stuck, but it was just taking 
too long (20 to 30 minutes)

Each time people click dropdown, I re-request the table list to ensure I keep 
the list up-to-date.
As database list and each table schema takes too long to request dynamically, I 
store them in a cache as people use them. But I would love this process took 
less time (Schema, isCached, isTemporary).

If you may take other suggestions, since TempViews always appear in a list 
tables, I have to do some manual logic to extract TempViews from requested 
tables list.

Also, isCached comes from SparkSession, not from the same place where catalog 
information is requested.

Our amount of tables is not insane (about 20 dbs and tops 200 tables per db, 
with some dbs only with a bunch of tables instead)

Best
Saif


was (Author: revolucion09):
Thanks [~viirya]
My program lists on a webpage the list of available tables by clicking a 
dropdown list of databases. Tables also show schema and metadata.

Each time people click dropdown, I re-request the table list to ensure I keep 
the list up-to-date.
As database list and each table schema takes too long to request dynamically, I 
store them in a cache as people use them. But I would love this process took 
less time (Schema, isCached, isTemporary).

If you may take other suggestions, since TempViews always appear in a list 
tables, I have to do some manual logic to extract TempViews from requested 
tables list.

Also, isCached comes from SparkSession, not from the same place where catalog 
information is requested.

Our amount of tables is not insane (about 20 dbs and tops 200 tables per db, 
with some dbs only with a bunch of tables instead)

Best
Saif

> SparkSession catalog is terribly slow
> -------------------------------------
>
>                 Key: SPARK-21198
>                 URL: https://issues.apache.org/jira/browse/SPARK-21198
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.1.0
>            Reporter: Saif Addin
>
> We have a considerably large Hive metastore and a Spark program that goes 
> through Hive data availability.
> In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and 
> sqlContext.isCached() to go throgh Hive metastore information.
> Once migrated to spark 2.x we switched over SparkSession.catalog instead, but 
> it turns out that both listDatabases() and listTables() take between 5 to 20 
> minutes depending on the database to return results, using operations such as 
> the following one:
> spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect
> and made the program unbearably slow to return a list of tables.
> I know we still have spark.sqlContext.tableNames as workaround but I am 
> assuming this is going to be deprecated anytime soon?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to