[jira] [Commented] (SPARK-30024) Support subdirectories when accessing partitioned Parquet Hive table

Hyukjin Kwon (Jira) Sun, 01 Dec 2019 18:39:22 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-30024?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985768#comment-16985768
 ]


Hyukjin Kwon commented on SPARK-30024:
--------------------------------------

[~lotkowskim] Let's start the discussion in mailing list first before filing an 
issue in a JIRA. Such discussions are better supposed to happen in mailing list 
and it can bring more attentions from spark dev.

> Support subdirectories when accessing partitioned Parquet Hive table
> --------------------------------------------------------------------
>
>                 Key: SPARK-30024
>                 URL: https://issues.apache.org/jira/browse/SPARK-30024
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.4
>            Reporter: Michael Lotkowski
>            Priority: Major
>
> Hi all,
>  
> We have ran in to issues when trying to read parquet partitioned table 
> created by Hive. I think I have narrowed down the cause to how 
> [InMemoryFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InMemoryFileIndex.scala#L95]]
>  created a parent -> file mapping.
>  
> The folder structure created by Hive is as follows:
> s3://bucket/table/date=2019-11-25/subdir1/data1.parquet
> s3://bucket/table/date=2019-11-25/subdir2/data2.parquet
>  
> Looking through the code it seems that InMemoryFileIndex is creating a 
> mapping of leaf files to their parents yielding the following mapping:
>  val leafDirToChildrenFiles = Map(
>     s3://bucket/table/date=2019-11-25/subdir1 -> 
> s3://bucket/table/date=2019-11-25/subdir1/data1.parquet,
>     s3://bucket/table/date=2019-11-25/subdir2 -> 
> s3://bucket/table/date=2019-11-25/subdir2/data2.parquet
> )
>  
> Which then in turn is used in 
> [PartitioningAwareFileIndex|[https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L83]]
> to prune the partitions. From my understanding pruning works by looking up 
> the partition path in leafDirToChildrenFiles which in this case is 
> s3://bucket/table/date=2019-11-25/ and therefore it fails to find any files 
> for this partition.
>  
> My suggested fix is to update how the InMemoryFileIndex builds the mapping, 
> instead of having a map between parent dir to file, is to have a map of 
> rootPath to file. More concretely 
> [https://gist.github.com/lotkowskim/76e8ff265493efd0b2b7175446805a82]
>  
> I have tested this by updating the jar running on EMR and we correctly can 
> now read the data from these partitioned tables. It's also worth noting that 
> we can read the data, without any modifications to the code, if we use the 
> following settings:
>  
> "spark.sql.hive.convertMetastoreParquet" to "false",
> "spark.hive.mapred.supports.subdirectories" to "true",
> "spark.hadoop.mapreduce.input.fileinputformat.input.dir.recursive" to "true"
>  
> However with these settings we lose the ability to prune partitions causing 
> us to read the entire table every time as we aren't using a Spark relation.
>  
> I want to start discussion on whether this is a correct change, or if we are 
> missing something more obvious. In either case I would be happy to fully 
> implement the change.
>  
> Thanks,
> Michael
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-30024) Support subdirectories when accessing partitioned Parquet Hive table

Reply via email to