[ https://issues.apache.org/jira/browse/SPARK-21198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063431#comment-16063431 ]
Saif Addin edited comment on SPARK-21198 at 6/26/17 5:23 PM: ------------------------------------------------------------- Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") ....other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: {color:red}6.978 seconds{color}. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: {color:red}194.501 seconds{color}. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: {color:red}17.907 seconds{color}. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: {color:red}4.642 seconds{color}. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: {color:red}126.999 seconds{color}. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program spark-submit: {code:java} import java.util.Date val dbs = spark.catalog.listDatabases.map(_.name).collect for (d <- dbs) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") } {code} {code:java} Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables Processed tables in DB using catalog. Time: {color:red}6.285 seconds{color}. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables Processed tables in DB using catalog. Time: {color:red}201.295 seconds{color}. 608 tables Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables ... goes on. timings similar {code} So. Apart from the weird listDatabases issue, listTables is consistently slow. was (Author: revolucion09): Regarding listtables, here is the code used inside the program: {code:java} println(s"processing all tables for every db. db length is: ${databases.tail.length}") for (d <- databases.tail) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") ....other stuff {code} and timings are as follows {code:java} processing all tables for every db. db length is: 30 Processed tables in DB using sqlContext. Time: 0.863 seconds. 19 tables Processed tables in DB using catalog. Time: 6.978 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 607 tables Processed tables in DB using catalog. Time: 194.501 seconds. 607 tables Processed tables in DB using sqlContext. Time: 0.243 seconds. 55 tables Processed tables in DB using catalog. Time: 17.907 seconds. 55 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 13 tables Processed tables in DB using catalog. Time: 4.642 seconds. 13 tables Processed tables in DB using sqlContext. Time: 0.238 seconds. 392 tables Processed tables in DB using catalog. Time: 126.999 seconds. 392 tables ... goes on... {code} in stand-alone spark-shell, as opposed to full-program spark-submit: {code:java} import java.util.Date val dbs = spark.catalog.listDatabases.map(_.name).collect for (d <- dbs) { val d1 = new Date().getTime val dbs = spark.sqlContext.tables(d).filter("isTemporary = false").select("tableName").collect.map(_.getString(0)) println("Processed tables in DB using sqlContext. Time: " + ((new Date().getTime - d1) / 1000.0) + s" seconds. ${dbs.length} tables") val d2 = new Date().getTime val dbs2 = spark.catalog.listTables(d).filter(!_.isTemporary).map(_.name).collect println("Processed tables in DB using catalog. Time: " + ((new Date().getTime - d2) / 1000.0) + s" seconds. ${dbs.length} tables") } {code} {code:java} Processed tables in DB using sqlContext. Time: 0.59 seconds. 19 tables Processed tables in DB using catalog. Time: 6.285 seconds. 19 tables Processed tables in DB using sqlContext. Time: 0.276 seconds. 608 tables Processed tables in DB using catalog. Time: 201.295 seconds. 608 tables Processed tables in DB using sqlContext. Time: 0.241 seconds. 55 tables ... goes on. timings similar {code} So. Apart from the weird listDatabases issue, listTables is consistently slow. > SparkSession catalog is terribly slow > ------------------------------------- > > Key: SPARK-21198 > URL: https://issues.apache.org/jira/browse/SPARK-21198 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Saif Addin > > We have a considerably large Hive metastore and a Spark program that goes > through Hive data availability. > In spark 1.x, we were using sqlConext.tableNames, sqlContext.sql() and > sqlContext.isCached() to go throgh Hive metastore information. > Once migrated to spark 2.x we switched over SparkSession.catalog instead, but > it turns out that both listDatabases() and listTables() take between 5 to 20 > minutes depending on the database to return results, using operations such as > the following one: > spark.catalog.listTables(db).filter(__.isTemporary).map(__.name).collect > and made the program unbearably slow to return a list of tables. > I know we still have spark.sqlContext.tableNames as workaround but I am > assuming this is going to be deprecated anytime soon? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org