[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow

Yana Kadiyska (JIRA) Fri, 17 Apr 2015 09:29:28 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-6984?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Yana Kadiyska updated SPARK-6984:
---------------------------------
    Description: 
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
"describe sometable" also performs _very_ poorly
{quote}
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236
{quote}
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. "describe table" is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html


  was:
I have a table with _many_partitions (30K). Users cannot query all of them but 
they are in the metastore. Querying this table is extremely slow even if we're 
asking for a single partition. 
"describe table" also performs _very_ poorly
{quote}
Spark produces the following times:
Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL query: 
72.831, Reading results: 0.189

Whereas Hive over the same metastore shows:
Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
0.204, Reading results: 0.236
{quote}
I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
object for each partition, which is puzzling to me (attaching screenshot). 
Should this value be lazy -- describe table should be purely a metastore op IMO 
(i.e. query postgres, return types).

The issue is a blocker to me but leaving with default priority until someone 
can confirm it is a bug. "describe table" is not so interesting but I think 
this affects all query paths -- I sent an inquiry earlier here: 
https://www.mail-archive.com/user@spark.apache.org/msg26242.html



> Operations on tables with many partitions _very_slow
> ----------------------------------------------------
>
>                 Key: SPARK-6984
>                 URL: https://issues.apache.org/jira/browse/SPARK-6984
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.1
>         Environment: External Hive metastore, table with 30K partitions
>            Reporter: Yana Kadiyska
>         Attachments: 7282_partitions_stack.png
>
>
> I have a table with _many_partitions (30K). Users cannot query all of them 
> but they are in the metastore. Querying this table is extremely slow even if 
> we're asking for a single partition. 
> "describe sometable" also performs _very_ poorly
> {quote}
> Spark produces the following times:
> Query 1 of 1, Rows read: 50, Elapsed time (seconds) - Total: 73.02, SQL 
> query: 72.831, Reading results: 0.189
> Whereas Hive over the same metastore shows:
> Query 1 of 1, Rows read: 47, Elapsed time (seconds) - Total: 0.44, SQL query: 
> 0.204, Reading results: 0.236
> {quote}
> I attempted to debug this and noticed that HiveMetastoreCatalog constructs an 
> object for each partition, which is puzzling to me (attaching screenshot). 
> Should this value be lazy -- describe table should be purely a metastore op 
> IMO (i.e. query postgres, return types).
> The issue is a blocker to me but leaving with default priority until someone 
> can confirm it is a bug. "describe table" is not so interesting but I think 
> this affects all query paths -- I sent an inquiry earlier here: 
> https://www.mail-archive.com/user@spark.apache.org/msg26242.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-6984) Operations on tables with many partitions _very_slow

Reply via email to