[jira] [Commented] (SPARK-6910) Support for pushing predicates down to metastore for partition pruning

2016-04-11 Thread Mick Davies (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15235370#comment-15235370
 ] 

Mick Davies commented on SPARK-6910:


Hi, 

We are seeing something similar, but in our case subsequent queries are still 
expensive. Looking at HiveMetastoreCatalog.lookupRelation (we are using 1.5, 
but 1.6 looks the same) we seem to create a new MetastoreRelation for each 
query. Part of the analysis phase tries to convert this to a ParquetRelation 
using convertToParquetRelation which always calls 
metastoreRelation.getHiveQlPartitions() which gets all partition information. 
So every query incurs the cost of retrieving all partition info.

We don't understand how the code can use the cachedDataSourceTables effectively 
in the circumstances just described.

We changed HiveMetastoreCatalog.lookupRelation to use cache even if Hive table 
property "spark.sql.sources.provider" is unset which caused subsequent queries 
to use cached relation and therfore run more quickly.

Eg, changed 
{code}
if (table.properties.get("spark.sql.sources.provider").isDefined) 
{code}

to 
{code}
if (cachedDataSourceTables.getIfPresent(QualifiedTableName(databaseName, 
tblName).toLowerCase) != null ||
  table.properties.get("spark.sql.sources.provider").isDefined) 
{code}

Are we doing something wrong?




> Support for pushing predicates down to metastore for partition pruning
> --
>
> Key: SPARK-6910
> URL: https://issues.apache.org/jira/browse/SPARK-6910
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Michael Armbrust
>Assignee: Cheolsoo Park
>Priority: Critical
> Fix For: 1.5.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-8077) Optimisation of TreeNode for large number of children

2015-06-03 Thread Mick Davies (JIRA)
Mick Davies created SPARK-8077:
--

 Summary: Optimisation of TreeNode for large number of children
 Key: SPARK-8077
 URL: https://issues.apache.org/jira/browse/SPARK-8077
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 1.3.1
Reporter: Mick Davies
Priority: Minor


Large IN clauses are parsed very slowly. For example SQL below (10K items in 
IN) takes 45-50s. 

{code}
sSELECT * FROM Person WHERE ForeName IN ('${(1 to 1).map(n + 
_).mkString(',')}')
{code}

This is principally due to TreeNode which repeatedly call contains on children, 
where children in this case is a List that is 10K long. In effect parsing for 
large IN clauses is O(N squared).

A small change that uses a lazily initialised Set based on children for 
contains reduces parse time to around 2.5s

I'd like to create PR for change, as we often use IN clauses with a few 
thousand items.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org