[jira] [Updated] (SPARK-17179) Consider improving partition pruning in HiveMetastoreCatalog

Steve Loughran (JIRA) Mon, 22 Aug 2016 01:37:33 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-17179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Steve Loughran updated SPARK-17179:
-----------------------------------
    Affects Version/s: 2.0.0
             Priority: Major  (was: Critical)
          Description: 
Issue:
- Create an external table with 1000s of partition
- Running simple query with partition details ends up listing all files for 
caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
based FS access (e.g S3). Even though, ListingFileCatalog supports
multi-threading, it would end up unncessarily listing 1000+ files when user is 
just interested in 1 partition.
- This adds up additional overhead in HiveMetastoreCatalog as it queries all 
partitions in convertToLogicalRelation 
(metastoreRelation.getHiveQlPartitions()).  Partition related details
are not passed in here, so ends up overloading hive metastore.
- Also even if any partition changes, cache would be dirtied and have to be 
re-populated.  It would be nice to prune the partitions in metastore layer 
itself, so that few partitions are looked up via FileSystem and only few items 
are cached.

{noformat}
"CREATE EXTERNAL TABLE `ca_par_ext`(
  `customer_id` bigint,
  `account_id` bigint)
PARTITIONED BY (
  `effective_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://bucket_details/ca_par'"

explain select count(*) from ca_par_ext where effective_date between 
'2015-12-17' and '2015-12-18';

{noformat}

  was:

Issue:
- Create an external table with 1000s of partition
- Running simple query with partition details ends up listing all files for 
caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
based FS access (e.g S3). Even though, ListingFileCatalog supports
multi-threading, it would end up unncessarily listing 1000+ files when user is 
just interested in 1 partition.
- This adds up additional overhead in HiveMetastoreCatalog as it queries all 
partitions in convertToLogicalRelation 
(metastoreRelation.getHiveQlPartitions()).  Partition related details
are not passed in here, so ends up overloading hive metastore.
- Also even if any partition changes, cache would be dirtied and have to be 
re-populated.  It would be nice to prune the partitions in metastore layer 
itself, so that few partitions are looked up via FileSystem and only few items 
are cached.

{noformat}
"CREATE EXTERNAL TABLE `ca_par_ext`(
  `customer_id` bigint,
  `account_id` bigint)
PARTITIONED BY (
  `effective_date` date)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3a://bucket_details/ca_par'"

explain select count(*) from ca_par_ext where effective_date between 
'2015-12-17' and '2015-12-18';

{noformat}


> Consider improving partition pruning in HiveMetastoreCatalog
> ------------------------------------------------------------
>
>                 Key: SPARK-17179
>                 URL: https://issues.apache.org/jira/browse/SPARK-17179
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.0.0
>            Reporter: Rajesh Balamohan
>
> Issue:
> - Create an external table with 1000s of partition
> - Running simple query with partition details ends up listing all files for 
> caching in ListingFileCatalog.  This would turn out to be very slow in cloud 
> based FS access (e.g S3). Even though, ListingFileCatalog supports
> multi-threading, it would end up unncessarily listing 1000+ files when user 
> is just interested in 1 partition.
> - This adds up additional overhead in HiveMetastoreCatalog as it queries all 
> partitions in convertToLogicalRelation 
> (metastoreRelation.getHiveQlPartitions()).  Partition related details
> are not passed in here, so ends up overloading hive metastore.
> - Also even if any partition changes, cache would be dirtied and have to be 
> re-populated.  It would be nice to prune the partitions in metastore layer 
> itself, so that few partitions are looked up via FileSystem and only few 
> items are cached.
> {noformat}
> "CREATE EXTERNAL TABLE `ca_par_ext`(
>   `customer_id` bigint,
>   `account_id` bigint)
> PARTITIONED BY (
>   `effective_date` date)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
> LOCATION
>   's3a://bucket_details/ca_par'"
> explain select count(*) from ca_par_ext where effective_date between 
> '2015-12-17' and '2015-12-18';
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-17179) Consider improving partition pruning in HiveMetastoreCatalog

Reply via email to