[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063406#comment-16063406 ]
Saif Addin commented on SPARK-21198: ------------------------------------ Okay, I think there is something odd somewhere in between. It may be hard to tackle the issue but, i'll start slowly line to line. (spark_submit, local[8]) 1. This line, gets stuck forever and the program doesnt continue after waiting 2 minutes (spark-task is stuck in collect()) {code:java} import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.filter(_ != blacklistdb).map(_.name).collect) {code} 2. Changing such line to (notice filter after collect) takes 8ms import spark.implicits._ private val databases: Array[String] = REFERENCEDB +: (spark.catalog.listDatabases.map(_.name).collect.filterNot(_ == blacklistdb)) 3. The following line, takes instead 2ms private val databases: Array[String] = (REFERENCEDB +: spark.sql("show databases").collect.map(_.getString(0))).filterNot(_ == blacklistdb) Here's the weirdest of all, if I instead start spark-shell and do item number 1, it works (takes 1ms which is even faster than in my program. the other lines are also faster down to 1ms as well) > SparkSession catalog is terribly slow > ------------------------------------- > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org