[ https://issues.apache.org/jira/browse/SPARK-24974?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-24974. ---------------------------------- Resolution: Incomplete > Spark put all file's paths into SharedInMemoryCache even for unused > partitions. > ------------------------------------------------------------------------------- > > Key: SPARK-24974 > URL: https://issues.apache.org/jira/browse/SPARK-24974 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.1 > Reporter: andrzej.stankev...@gmail.com > Priority: Major > Labels: bulk-closed > > SharedInMemoryCache has all filestatus no matter whether you specify > partition columns or not. It causes long load time for queries that use only > couple partitions because Spark loads file's paths for files from all > partitions. > I partitioned files by *report_date* and *type* and i have directory > structure like > {code:java} > /custom_path/report_date=2018-07-24/type=A/file_1.parquet > {code} > > I am trying to execute > {code:java} > val count = spark.read.parquet("/custom_path/report_date=2018-07-24").filter( > "type == 'A'").count > {code} > > In my query i need to load only files of type *A* and it is just a couple of > files. But spark load all 19K of files from all partitions into > SharedInMemoryCache which takes about 60 secs and only after that throws > unused partitions. > > This could be related to [https://jira.apache.org/jira/browse/SPARK-17994] -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org