[ 
https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-24974.
----------------------------------
    Resolution: Incomplete

> Spark put all file's paths into SharedInMemoryCache even for unused 
> partitions.
> -------------------------------------------------------------------------------
>
>                 Key: SPARK-24974
>                 URL: https://issues.apache.org/jira/browse/SPARK-24974
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.2.1
>            Reporter: andrzej.stankev...@gmail.com
>            Priority: Major
>              Labels: bulk-closed
>
> SharedInMemoryCache has all  filestatus no matter whether you specify 
> partition columns or not. It causes long load time for queries that use only 
> couple partitions because Spark loads file's paths for files from all 
> partitions.
> I partitioned files by *report_date* and *type* and i have directory 
> structure like 
> {code:java}
> /custom_path/report_date=2018-07-24/type=A/file_1.parquet
> {code}
>  
> I am trying to execute 
> {code:java}
> val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( 
> "type == 'A'").count
> {code}
>  
> In my query i need to load only files of type *A* and it is just a couple of 
> files. But spark load all 19K of files from all partitions into 
> SharedInMemoryCache which takes about 60 secs and only after that throws 
> unused partitions. 
>  
> This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to