[ https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yana Kadiyska updated SPARK-6984: --------------------------------- Description: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. "describe sometable" also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html was: I have a table with _many_partitions (30K). Users cannot query all of them but they are in the metastore. Querying this table is extremely slow even if we're asking for a single partition. "describe table" also performs _very_ poorly {quote} Spark produces the following times: Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 72.831, Reading results: 0.189 Whereas Hive over the same metastore shows: Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 0.204, Reading results: 0.236 {quote} I attempted to debug this and noticed that HiveMetastoreCatalog constructs an object for each partition, which is puzzling to me (attaching screenshot). Should this value be lazy -- describe table should be purely a metastore op IMO (i.e. query postgres, return types). The issue is a blocker to me but leaving with default priority until someone can confirm it is a bug. "describe table" is not so interesting but I think this affects all query paths -- I sent an inquiry earlier here: https://www.mail-archive.com/user@spark.apache.org/msg26242.html > Operations on tables with many partitions _very_slow > ---------------------------------------------------- > > Key: SPARK-6984 > URL: https://issues.apache.org/jira/browse/SPARK-6984 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 1.2.1 > Environment: External Hive metastore, table with 30K partitions > Reporter: Yana Kadiyska > Attachments: 7282_partitions_stack.png > > > I have a table with _many_partitions (30K). Users cannot query all of them > but they are in the metastore. Querying this table is extremely slow even if > we're asking for a single partition. > "describe sometable" also performs _very_ poorly > {quote} > Spark produces the following times: > Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL > query: 72.831, Reading results: 0.189 > Whereas Hive over the same metastore shows: > Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: > 0.204, Reading results: 0.236 > {quote} > I attempted to debug this and noticed that HiveMetastoreCatalog constructs an > object for each partition, which is puzzling to me (attaching screenshot). > Should this value be lazy -- describe table should be purely a metastore op > IMO (i.e. query postgres, return types). > The issue is a blocker to me but leaving with default priority until someone > can confirm it is a bug. "describe table" is not so interesting but I think > this affects all query paths -- I sent an inquiry earlier here: > https://www.mail-archive.com/user@spark.apache.org/msg26242.html -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org