[ https://issues.apache.org/jira/browse/SPARK-30024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985768#comment-16985768 ]
Hyukjin Kwon commented on SPARK-30024: -------------------------------------- [~lotkowskim] Let's start the discussion in mailing list first before filing an issue in a JIRA. Such discussions are better supposed to happen in mailing list and it can bring more attentions from spark dev. > Support subdirectories when accessing partitioned Parquet Hive table > -------------------------------------------------------------------- > > Key: SPARK-30024 > URL: https://issues.apache.org/jira/browse/SPARK-30024 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.4.4 > Reporter: Michael Lotkowski > Priority: Major > > Hi all, > > We have ran in to issues when trying to read parquet partitioned table > created by Hive. I think I have narrowed down the cause to how > [InMemoryFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L95]] > created a parent -> file mapping. > > The folder structure created by Hive is as follows: > s3://bucket/table/date=2019-11-25/subdir1/data1.parquet > s3://bucket/table/date=2019-11-25/subdir2/data2.parquet > > Looking through the code it seems that InMemoryFileIndex is creating a > mapping of leaf files to their parents yielding the following mapping: > val leafDirToChildrenFiles = Map( > s3://bucket/table/date=2019-11-25/subdir1 -> > s3://bucket/table/date=2019-11-25/subdir1/data1.parquet, > s3://bucket/table/date=2019-11-25/subdir2 -> > s3://bucket/table/date=2019-11-25/subdir2/data2.parquet > ) > > Which then in turn is used in > [PartitioningAwareFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L83]] > to prune the partitions. From my understanding pruning works by looking up > the partition path in leafDirToChildrenFiles which in this case is > s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files > for this partition. > > My suggested fix is to update how the InMemoryFileIndex builds the mapping, > instead of having a map between parent dir to file, is to have a map of > rootPath to file. More concretely > [https://gist.github.com/lotkowskim/76e8ff265493efd0b2b7175446805a82] > > I have tested this by updating the jar running on EMR and we correctly can > now read the data from these partitioned tables. It's also worth noting that > we can read the data, without any modifications to the code, if we use the > following settings: > > "spark.sql.hive.convertMetastoreParquet" to "false", > "spark.hive.mapred.supports.subdirectories" to "true", > "spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true" > > However with these settings we lose the ability to prune partitions causing > us to read the entire table every time as we aren't using a Spark relation. > > I want to start discussion on whether this is a correct change, or if we are > missing something more obvious. In either case I would be happy to fully > implement the change. > > Thanks, > Michael > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org